bioml-tools / bio-datasets

Bringing bio (molecules and more) to the Hugging Face Datasets library
Apache License 2.0
5 stars 2 forks source link

write a load_example method #63

Open alex-hh opened 2 days ago

alex-hh commented 2 days ago

assuming a dataset has an id field and an index.

alex-hh commented 2 days ago

index will be a parquet file with no extension mapping id to shard - then we can download a single shard and retrieve the example

alex-hh commented 2 days ago

What we need:

a split generator that looks for config+split-specific index files (train_index or train/index) index files allow us to subset both parquets and examples we then add a ds.filter before returning the dataset. there might be an efficient arrow way to implement the filter

(this could also go directly into yaml but the index file solution is more modular).