Pretraining example is not working

iloshchilov commented 2 months ago

The pretraining example with

litgpt pretrain \
   --model_name pythia-14m \
   --config https://raw.githubusercontent.com/Lightning-AI/litgpt/main/config_hub/pretrain/debug.yaml

is doing some data preprocessing (I guess) which slows-down from >100 it/sec to about 14 it/sec. Overall, it takes about 1 hour which seems >10x longer than it should be for this small dataset. At the end, it does not complete because some workers still have something to do that they don't do:

Worker 18 is terminating. Worker 18 is done.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 99996/100000 [57:22<00:00, 41.02it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 99986/100000 [57:23<00:00, 40.94it/s] Progress: 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 45/49 [57:38<05:07, 76.85s/it] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 99996/100000 [57:23<00:00, 41.58it/s]

99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 99024/100000 [57:12<00:24, 40.35it/s]

99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 99429/100000 [57:22<00:13, 42.13it/s] 99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 99434/100000 [57:22<00:13, 42.12it/s]

99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 99449/100000 [57:22<00:13, 42.32it/s]

99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 99459/100000 [57:22<00:12, 42.21it/s]

99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 99479/100000 [57:23<00:12, 40.69it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 99999/100000 [57:35<00:00, 42.37it/s]

When I relaunch that code, it restarts the whole data preprocessing from scratch. Could you please have a look at it? If I understand correctly, this pretraining example is relatively new.

rasbt commented 2 months ago

Sorry to hear about the issues here. I remember having similar problem on a particular machine. Lowering the number of workers in the TinyData code (https://github.com/Lightning-AI/litgpt/blob/main/litgpt/data/tinystories.py) fixed it for me.

Maybe we should consider a lower default, what do you think @awaelchli ? We could maybe choose a low default and then display a warning similar to PyTorch Lightning that things could be sped up with multiple workers if the machine supports it.

ebektas commented 2 weeks ago

I'm having a similar problem as well, with a A100 cluster where 100 workers are created and after a while, they simply hang and stop consuming cpu/ram resources.

I am using TextFiles data type I would like to pass my own data-worker for starters, with lower count of workers 8-15 where I game the code by giving it 15 files. Secondly would it be possible to have a seperate command for preparing the data altogether? so I pass the prebatched binaries to litgpt pretrain maybe?

note: I have over 1k files with 500mb each in size.

Lightning-AI / litgpt

Pretraining example is not working #1318