Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
354 stars 40 forks source link

Data shard delation with multi GPU does not work #140

Open rakro101 opened 5 months ago

rakro101 commented 5 months ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create a litdata set, stream the shard (image 224,224,3 + some text) and using mutli GPU using Bert + Resnet setting the max_cache_size="6GB"

Added a studio to reproduce the issue.

Code sample

Added a studio to reproduce the error.

Additional context

tchaton commented 5 months ago

From the logs, it seems 4 processes are downloading the chunks but one deletes it before the other are finished with it.

DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
Sanity Checking DataLoader 0:   0%|                                                                                                                                       | 0/2 [00:00<?, ?it/s]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
Epoch 0:   1%|▍                                 | 280/20000 [05:13<6:08:00,  0.89it/s, v_num=10, train/loss=2.240, train/acc=0.124, train/f1=0.0799, train/recall=0.124, train/precision=0.0627]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
Epoch 0:   1%|▌                                   | 281/20000 [05:14<6:07:58,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.209, train/f1=0.124, train/recall=0.209, train/precision=0.116]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|▍                                  | 282/20000 [05:15<6:07:54,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.130, train/f1=0.107, train/recall=0.130, train/precision=0.0993]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|▌                                   | 283/20000 [05:16<6:07:52,  0.89it/s, v_num=10, train/loss=2.220, train/acc=0.105, train/f1=0.103, train/recall=0.105, train/precision=0.206]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|▍                               | 284/20000 [05:17<6:07:48,  0.89it/s, v_num=10, train/loss=2.250, train/acc=0.0921, train/f1=0.0709, train/recall=0.0921, train/precision=0.0658]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|▍                                  | 285/20000 [05:18<6:07:45,  0.89it/s, v_num=10, train/loss=2.170, train/acc=0.120, train/f1=0.099, train/recall=0.120, train/precision=0.0995]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
Epoch 0:   1%|▍                                 | 286/20000 [05:20<6:07:42,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.102, train/f1=0.0932, train/recall=0.102, train/precision=0.0884]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|▍                                 | 287/20000 [05:21<6:07:39,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.102, train/f1=0.0824, train/recall=0.102, train/precision=0.0897]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|▌                                   | 289/20000 [05:23<6:07:33,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.140, train/f1=0.107, train/recall=0.140, train/precision=0.148]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
Epoch 0:   2%|▌                                 | 347/20000 [06:26<6:05:01,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, tEpoch 0:   2%| | 348/20000 [06:27<6:04:58,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, train/recall=0.105, train/precisioEpoch 0:   2%| | 360/20000 [06:40<6:04:32,  0.90it/s, v_num=10, train/loss=2.190, train/acc=0.138, train/f1=0.0883, train/recall=0.138, train/precisioTraceback (most recent call last):
  File "/teamspace/studios/this_studio/train.py", line 107, in <module>
...
RuntimeError: Waiting too long for the /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin to be ready
rakro101 commented 5 months ago

Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future)

tchaton commented 5 months ago

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

deeptimhe commented 5 months ago

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

Looking forward to the examples!