Open dangxuanhong opened 1 month ago
The problem is not with long documents, I tried by splitting the long documents into chunks
Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.
The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory
Checking on the status of this - the memory consumption ended up being related to how the legal-file-detection was working IIRC?
As we may have to deal with very long documents up to millions of characters/tokens, the
dataloader
may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.An approach of splittng a long document into chunks should be considered as an example here.