NLP tokenization kernel

We could explore introducing tokenization using the https://docs.rs/tokenizers/latest/tokenizers/ crate! Here's a proposal (cc @arnavgrg who might be interested in taking a stab at it)

API

df = df.read_parquet("s3://foo/myfile.parquet")
df = df.with_column("tokenized_data", df["post_titles"].nlp.tokenize("bert-base-cased"))
df.collect()

The above snippet will tokenize strings in the "post_titles" column, producing a new Tensor column.

Implementation

Add a new NLP "namespace" for expressions (see: ExpressionStringNamespace for an example) and add a .tokenize() function
Forward the call from this function to Rust through the Expression pyo3 layer (see: Expression pyo3 bindings)
Create a new FuncExpr to represent the Tokenization expression (see: existing String expressions) a. fn to_field should take a field with a DataType::Utf8 datatype and produce a field with a DataType::Tensor datatype b. fn evaluate should call into the appropriate mathod on Series to perform tokenization on a Series of strings
Implement Series.tokenize() which would just downcast the series into a Utf8Array and stick each element into the tokenization crate, finally wrapping the results into a new TensorArray.

Eventual-Inc / Daft

NLP tokenization kernel #889

API

Implementation