JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

Storage space requirement #6

Closed okpatil4u closed 1 year ago

okpatil4u commented 1 year ago

Hello,

How much storage space should I reserve to run following recipe ?

python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c4 train=bert-o3 train.batch_size=4096 data=c4-subset-processed

okpatil4u commented 1 year ago

Also, how much time should it take to process all 1024 files ?

https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00000-of-01024.json.gz

JonasGeiping commented 1 year ago

73G 62GB storage for the processed dataset. Some amount of temporary storage during preprocessing, depending on the amount of parallelism.

The script will not download all 1024 C4 pieces, only as many to fill up data.max_entries_in_raw_dataset. This is a download from the allenai server, timings may vary.

For preprocessing in general, please first make sure that easier datasets work, before prepping for c4-subset-processed as mentioned here: https://github.com/JonasGeiping/cramming/issues/2#issuecomment-1369145477

JonasGeiping commented 1 year ago

See also https://github.com/JonasGeiping/cramming/issues/2#issuecomment-1375756119

JonasGeiping commented 1 year ago

Please re-open this issues if any questions remain :)