eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
136 stars 19 forks source link

DRAFT: pandas and arrow extensions for type inference #657

Open changhiskhan opened 2 years ago

changhiskhan commented 2 years ago

Pandas/Arrow extensions may make it possible to easily read/write the Rikai types to/from parquet datasets.

One major design issue is that currently we're boxing/unboxing the individual elements into Rikai's extension types. This has the following problems:

  1. Rikai types (e.g., Image) must be a valid storage type. Here I've made it subclass dict. This clearly needs more thought on how to actually represent the Rikai types.
  2. Performance takes a huge hit

Instead one possible route is that we implement the Rikai extension methods (e.g., Image.crop) as a vectorized method in the ImageArray pandas extension array. If we can make this work consistently then this allows us to essentially only box to Image when we want to pick out a particular element out of the array.

As always, the problem is nested data. Currently I have it "working" using some pandas customizations + manual boxing/unboxing. It would definitely be a problem for larger datasets due to performance issues.

Remaining major issues: