Open bely66 opened 3 months ago
Hello!
My recommendation is to look at the datasets documentation. To my knowledge, it allows you to load large datasets without incurring large memory costs. You can load data of various formats and save them as Arrow files.
Hi @tomaarsen Yes, iterable datasets solve the issue, but there's no direct support on Sentence Transformers.
that's what I was asking about, any idea how to import iterable dataset inside of my training loop
IterableDatasets support is being added in #2792. You can already experiment with it via:
pip install git+https://github.com/tomaarsen/sentence-transformers.git@feat/streaming_datasets
Then you can load the dataset with streaming=True
, e.g.
from datasets import load_dataset
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=streaming)
although this is not a dataset from disk, but instead from the Hugging Face Hub.
Otherwise you can use load_from_disk
, by default this doesn't put the dataset in memory I believe: https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_from_disk
@bely66 @tomaarsen
When just iterating over a dataset, huggingface dataset does not load all the data in memory. https://huggingface.co/learn/nlp-course/en/chapter5/4
But currently batch samplers https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/trainer.py#L484 use SubsetRandomSampler and it tries to load random part of data on memory.
So in my case, shuffle datasets before saving to disk and using SequentialSampler instead helped me training over a large dataset that can't be trained with the released version.
Hi Everyone,
I'm trying to run model pre-training using a large dataset +150GB.
And looked around for any reference to do that using the APIs of the library but found nothing saddly.
Any idea on how I can do that, or if there's any close release that will resolve this?
Thank you so much for the continuous support