Open samster25 opened 1 year ago
We could explore introducing tokenization using the https://docs.rs/tokenizers/latest/tokenizers/ crate! Here's a proposal (cc @arnavgrg who might be interested in taking a stab at it)
df = df.read_parquet("s3://foo/myfile.parquet")
df = df.with_column("tokenized_data", df["post_titles"].nlp.tokenize("bert-base-cased"))
df.collect()
The above snippet will tokenize strings in the "post_titles"
column, producing a new Tensor
column.
.tokenize()
functionfn to_field
should take a field with a DataType::Utf8 datatype and produce a field with a DataType::Tensor datatype
b. fn evaluate
should call into the appropriate mathod on Series
to perform tokenization on a Series of stringsSeries.tokenize()
which would just downcast the series into a Utf8Array
and stick each element into the tokenization crate, finally wrapping the results into a new TensorArray
.
Add
.nlp.tokenize()
expression