Preprocessing for final recipe

florianmai commented 1 year ago

Hello!

I am wondering what the correct data preprocessing command is for the final recipe. Could you add this information to the README?

Also, is there a straight forward way to restrict memory requirements during preprocessing? It seems to use 60GB+ of RAM when reading data via gzip (using one of the preprocessing commands from scripts/preprocessing.sh). error-log.txt

JonasGeiping commented 1 year ago

Hi, I've added additional documentation for data preprocessing. In terms of data, do the sanity check datasets work for you, especially sanity-check-2?

Preprocessing is heavily paralellized, you can try to reduce the number of threads via impl.threads=1 to see if this would (slowly) preprocess the data, but some of the C4 options, such as deduplication, filtering and streaming have memory requirements that might be large.

Let me know if this continues to be a problem for you and maybe some of the code can be rewritten to save more RAM. For C4, I think you are at the data streaming bottleneck. Data is streamed and then dumped onto disk here: https://github.com/JonasGeiping/cramming/blob/8c6b1236cd9eda4f55d9ad2cf0b53d18cf079b28/cramming/data/pretraining_preparation.py#L159, which temporarily keeps the raw data in memory.

Overall, I used to preprocess the datasets on a setup with more RAM and then loaded the cached dataset onto the 32GB RAM GPU machine. It would be nice though, if RAM usage could be kept small for everything.

ghost commented 1 year ago

@JonasGeiping Could you point us to the code in question? Any chance you are using a numpy array that can be mmaped?

JonasGeiping commented 1 year ago

@centerofexcellence I think your RAM is filled at the line linked above.

I've just now updated the code to include a new option to chunk the stream writing step. Set impl.max_raw_chunk_size to a smaller number to reduce the memory impact from this step. impl.max_raw_chunk_size=1e6 should be a decent trade-off for C4.

Let me know if this resolves this RAM issue.

For everyone, to pin down where more memory savings are required, please try out the following datasets in order, before attempting c4-subset-processed: 1) data=sanity-check-2 2) data=bookcorpus-wikipedia 3) data=c4-subset

tensoralex commented 1 year ago

I ended up swapping deeply even for bookcorpus-wikipedia. Used 128Gb RAM + 46Gb in swap with 16 core CPU. Used this command python pretrain.py data=bookcorpus-wikipedia dryrun=True

Overall, I used to preprocess the datasets on a setup with more RAM and then loaded the cached dataset onto the 32GB RAM GPU machine. It would be nice though, if RAM usage could be kept small for everything.

How much RAM did you have?

JonasGeiping commented 1 year ago

@tensoralex Thanks for the data! I looked back on my runs, I preprocessed with even more RAM (256 for bookcorpus-wikipedia), but did not track how much RAM was actually used. I parallelized across 24 cores though, reducing the number of impl.threads should strictly reduce RAM usage in exchange for speed.

JonasGeiping commented 1 year ago

In case people are not interested in changing the data, and want to use the preprocessed datasets "as-is", I've now added additional documentation on where to download the preprocessed final recipe: https://github.com/JonasGeiping/cramming#preprocessed-datasets

brianchmiel commented 1 year ago

@JonasGeiping Dropbox doesn't allow downloading such big files. Is it possible to upload the dataset "data=c4-subset-processed" to some google-drive or another place that allows us to download it?

Thanks!

w32zhong commented 1 year ago

@brianchmiel You can actually click into the folders and download file by file.

JonasGeiping commented 1 year ago

@brianchmiel I've now zipped them on my side to generate single files.

JonasGeiping commented 1 year ago

I hope this resolves preprocessing questions so far, don't hesitate to re-open this issue if new problems come up!

JonasGeiping / cramming

Preprocessing for final recipe #2