tokenization on-the-fly for long documents

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

147 stars 26 forks source link

tokenization on-the-fly for long documents #106

Open dangxuanhong opened 1 month ago

dangxuanhong commented 1 month ago

As we may have to deal with very long documents up to millions of characters/tokens, the dataloader may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.

An approach of splittng a long document into chunks should be considered as an example here.

thinkahead commented 1 month ago

The problem is not with long documents, I tried by splitting the long documents into chunks

Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.

The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory

daviswer commented 2 weeks ago

Checking on the status of this - the memory consumption ended up being related to how the legal-file-detection was working IIRC?