Loading a large dataset in Batches from Disk

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.99k stars 2.45k forks source link

Loading a large dataset in Batches from Disk #2801

Open bely66 opened 3 months ago

bely66 commented 3 months ago

Hi Everyone,

I'm trying to run model pre-training using a large dataset +150GB.

And looked around for any reference to do that using the APIs of the library but found nothing saddly.

Any idea on how I can do that, or if there's any close release that will resolve this?

Thank you so much for the continuous support

tomaarsen commented 3 months ago

Hello!

My recommendation is to look at the datasets documentation. To my knowledge, it allows you to load large datasets without incurring large memory costs. You can load data of various formats and save them as Arrow files.

Tom Aarsen

bely66 commented 3 months ago

Hi @tomaarsen Yes, iterable datasets solve the issue, but there's no direct support on Sentence Transformers.

that's what I was asking about, any idea how to import iterable dataset inside of my training loop

tomaarsen commented 3 months ago

IterableDatasets support is being added in #2792. You can already experiment with it via:

pip install git+https://github.com/tomaarsen/sentence-transformers.git@feat/streaming_datasets

Then you can load the dataset with streaming=True, e.g.

from datasets import load_dataset

train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=streaming)

although this is not a dataset from disk, but instead from the Hugging Face Hub.

Otherwise you can use load_from_disk, by default this doesn't put the dataset in memory I believe: https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_from_disk

Tom Aarsen

zxzl commented 2 months ago

@bely66 @tomaarsen

When just iterating over a dataset, huggingface dataset does not load all the data in memory. https://huggingface.co/learn/nlp-course/en/chapter5/4

But currently batch samplers https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/trainer.py#L484 use SubsetRandomSampler and it tries to load random part of data on memory.

So in my case, shuffle datasets before saving to disk and using SequentialSampler instead helped me training over a large dataset that can't be trained with the released version.