Open jackapbutler opened 1 year ago
or consider other streaming options like using a generator with IterableDataset
or using Mosaic's streaming
library. Compare to the default arrow format and if we can use to_X
methods of Dataset
to enable other formats.
We can use S3 through the Dataset.load_from_disk
method but this is actually downloading the dataset to a tmp folder and then loading it into memory. We would prefer to not require downloading the full dataset on the EFS storage.
Integrating MosaicML's streaming dataset package is blocked by https://github.com/mosaicml/streaming/issues/208 as we also require lists of integers to represents our tokenised samples. We could do tokenisation on the fly but given we've already pre-tokenised the datasets this seems wasteful.
Seems we'll be better off using a Hugging Face workaround like converting datasets to JSON and uploading to the Hub if possible for now.
Add ability to pass a HF datasets with
streaming=True
and run it inside the training pipeline so we can run on very large datasets. Also understand the slowdown of using steaming overload_from_disk
.