eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
136 stars 19 forks source link

Support filter push down in rikai.dataset #639

Open eddyxu opened 2 years ago

eddyxu commented 2 years ago

When loading the dataset for training, we'd desire to split the dataset into train, test, eval split. And it should make it easy for an user to just load one of such split, for example


from rikai.pytorch.data import Dataset

train_dataset = Dataset("foo.bar", filters=["split = 'train']) 
eval_dataset = Dataset("foo.bar", filters=["split = 'eval'"])

We could look into the pyarrow's filters in parquet dataset to see whether we can use them.