Closed okpatil4u closed 1 year ago
Also, how much time should it take to process all 1024 files ?
https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00000-of-01024.json.gz
73G 62GB storage for the processed dataset. Some amount of temporary storage during preprocessing, depending on the amount of parallelism.
The script will not download all 1024 C4 pieces, only as many to fill up data.max_entries_in_raw_dataset
. This is a download from the allenai server, timings may vary.
For preprocessing in general, please first make sure that easier datasets work, before prepping for c4-subset-processed
as mentioned here: https://github.com/JonasGeiping/cramming/issues/2#issuecomment-1369145477
Please re-open this issues if any questions remain :)
Hello,
How much storage space should I reserve to run following recipe ?
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c4 train=bert-o3 train.batch_size=4096 data=c4-subset-processed