Open miguelalba96 opened 1 month ago
Hey @miguelalba96. Thanks for reporting this issue.
Would you mind printing the length of each dataset, dataloader on each rank. Usually it hangs when a rank have more data than others. It shouldn't happen but I want to exclude this eventuallity.
Do you think you could share a tiny reproducible example with dummy data for me to debug ?
Best, T.C
when printing the ranks per node and on each: len(dataset),
len(dataloader)` I get homogeneous number of samples on each:
Not sure how to reproduce this problem, I will check. I also noticed that when I load the state to resume training using the function I wrote above get_state
, the dataloader doesn't seem to resume properly and iterates all over again through the data until it hangs š¤:
Hey @miguelalba96, any chance you could create a reproducible Studio on https://lightning.ai/ that I can duplicate to investigate what's happening. Otherwise, it is hard for me to help you.
š Bug
I am training CLIP using pytorch lighing fabric + litdata on a distributed set up (4 nodes each 4 GPUs). I noticed that when finishing the 1st epoch the training dataloaders hang for some nodes.
The image bellow shows
fabric.print()
doing the logging on 4 nodes before finishing an epoch (I print every 25 steps). Only one rank successfully finishes the rest hang, otherwise the message++++ Epoch: 0 completed ++++
will appear 4 times, once in each node). I shared relevant parts of the code bellow, any help would be appreciatedAdditional context + Parts of Training code
mosaicml-streaming
on the same fabric training code and it works without issues.StreamingDataset setup
I am using the following script to load the litdata:
Am I setting properly the dataloader here?, I checked and litGPT uses torch DataLoader instead of
StreamingDataloader
Here I show what I managed to monitor on how CPU and RAM looks like for an entire epoch
You can see how instead of jumping again to load the samples of the test set, it hangs ...
Training using Fabric
I put here some parts of my training script which basically follows open-clip implementation but using Lighting Fabric, maybe I am doing something wrong when the epoch is finishing? I noticed
state
is not saved at the end of the epoch on the last iteration (just in the middle of training on checkpoint_step):Environment
europe-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13.py310:latest
Expected behavior
The training dataloader finish the epoch and the rest of the code continues its execution