Add support for streaming datasets in training

OpenBioML / chemnlp

ChemNLP project

MIT License

148 stars 46 forks source link

Add support for streaming datasets in training #329

Open jackapbutler opened 1 year ago

jackapbutler commented 1 year ago

Add ability to pass a HF datasets with streaming=True and run it inside the training pipeline so we can run on very large datasets. Also understand the slowdown of using steaming over load_from_disk.

jackapbutler commented 1 year ago

or consider other streaming options like using a generator with IterableDataset or using Mosaic's streaming library. Compare to the default arrow format and if we can use to_X methods of Dataset to enable other formats.

jackapbutler commented 1 year ago

We can use S3 through the Dataset.load_from_disk method but this is actually downloading the dataset to a tmp folder and then loading it into memory. We would prefer to not require downloading the full dataset on the EFS storage.

jackapbutler commented 1 year ago

Integrating MosaicML's streaming dataset package is blocked by https://github.com/mosaicml/streaming/issues/208 as we also require lists of integers to represents our tokenised samples. We could do tokenisation on the fly but given we've already pre-tokenised the datasets this seems wasteful.

Seems we'll be better off using a Hugging Face workaround like converting datasets to JSON and uploading to the Hub if possible for now.