Pandas/Arrow extensions may make it possible to easily read/write the Rikai types to/from parquet datasets.
For usage checkout the new tests/types/test_pandas.py file.
For a quick note on the proposed design, see top of rikai/types/pandas.py
One major design issue is that currently we're boxing/unboxing the
individual elements into Rikai's extension types. This has the following problems:
Rikai types (e.g., Image) must be a valid storage type. Here I've
made it subclass dict. This clearly needs more thought on how to
actually represent the Rikai types.
Performance takes a huge hit
Instead one possible route is that we implement the Rikai extension methods (e.g., Image.crop) as a vectorized method in the ImageArray pandas extension array. If we can make this work consistently then this allows us to essentially only box to Image when we want to pick out a particular element out of the array.
As always, the problem is nested data. Currently I have it "working" using some pandas customizations + manual boxing/unboxing. It would definitely be a problem for larger datasets due to performance issues.
Remaining major issues:
[ ] Decide on storage types for Rikai types
[ ] Handle NAs
[ ] Decide on whether to explicitly box to Rikai types eagerly
[ ] Handle other nesting patterns (e.g., struct, list of list of struct, struct of list of struct, etc)
[ ] Rikai extension arrays are missing some methods like from_factorized
Pandas/Arrow extensions may make it possible to easily read/write the Rikai types to/from parquet datasets.
tests/types/test_pandas.py
file.rikai/types/pandas.py
One major design issue is that currently we're boxing/unboxing the individual elements into Rikai's extension types. This has the following problems:
Instead one possible route is that we implement the Rikai extension methods (e.g.,
Image.crop
) as a vectorized method in the ImageArray pandas extension array. If we can make this work consistently then this allows us to essentially only box toImage
when we want to pick out a particular element out of the array.As always, the problem is nested data. Currently I have it "working" using some pandas customizations + manual boxing/unboxing. It would definitely be a problem for larger datasets due to performance issues.
Remaining major issues: