eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
136 stars 19 forks source link

DRAFT: Add the ability to write a pandas dataframe to disk in Rikai format without needing a live spark session #653

Closed changhiskhan closed 2 years ago

changhiskhan commented 2 years ago

Open issues:

  1. I had to change how Box2d is serialized/deserialized to make it work (otherwise the elements are written out of order)
  2. This probably means that other shapes are like also wrong right now, but I wanted to solicit feedback before testing/changing those
  3. Right now you still need to give it an explicit StructType schema so the resulting dataset is readable by spark and it knows when to convert the Rikai types into native types. It would be nice to add an option to just infer the types based on a sample of rows in the DataFrame.