Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.89k stars 121 forks source link

NLP tokenization kernel #889

Open samster25 opened 1 year ago

samster25 commented 1 year ago

Add .nlp.tokenize() expression

jaychia commented 10 months ago

We could explore introducing tokenization using the https://docs.rs/tokenizers/latest/tokenizers/ crate! Here's a proposal (cc @arnavgrg who might be interested in taking a stab at it)

API

df = df.read_parquet("s3://foo/myfile.parquet")
df = df.with_column("tokenized_data", df["post_titles"].nlp.tokenize("bert-base-cased"))
df.collect()

The above snippet will tokenize strings in the "post_titles" column, producing a new Tensor column.

Implementation

  1. Add a new NLP "namespace" for expressions (see: ExpressionStringNamespace for an example) and add a .tokenize() function
  2. Forward the call from this function to Rust through the Expression pyo3 layer (see: Expression pyo3 bindings)
  3. Create a new FuncExpr to represent the Tokenization expression (see: existing String expressions) a. fn to_field should take a field with a DataType::Utf8 datatype and produce a field with a DataType::Tensor datatype b. fn evaluate should call into the appropriate mathod on Series to perform tokenization on a Series of strings
  4. Implement Series.tokenize() which would just downcast the series into a Utf8Array and stick each element into the tokenization crate, finally wrapping the results into a new TensorArray.