Open alex-hh opened 2 days ago
index will be a parquet file with no extension mapping id to shard - then we can download a single shard and retrieve the example
What we need:
a split generator that looks for config+split-specific index files (train_index or train/index) index files allow us to subset both parquets and examples we then add a ds.filter before returning the dataset. there might be an efficient arrow way to implement the filter
(this could also go directly into yaml but the index file solution is more modular).
assuming a dataset has an id field and an index.