Open hubenjm opened 3 months ago
Hi! thanks for your contribution!, great first issue!
Hey @hubenjm,
Did you provide drop_last=True
to the StreamingDataLoader
for the training dataset ?
Could you share a reproducible script or the code of your training dataset ?
Hey @hubenjm. Any updates ?
@tchaton Thanks for the suggestions. I am currently trying to run my code again while explicitly setting drop_last = True
when instantiating any StreamingDataset
objects.
OK, that did not work either, so I am going to have to work on creating a simpler example code to share that replicates the problem
To replicate the problem, you first need to run python generate_optimized_data.py --data-s3-prefix s3://<whatever-bucket-you-own-is>/toy-combined-dataset-example/optimized-data/
Then after that data is generated in s3, to submit a training job in SageMaker with e.g. 2 nodes, follow the submit_training_job.ipynb
notebook (replace the s3_data_input_prefix
variable with same value as used above. Change num_nodes
variable from 2 to 1 or whatever.
If you want to run the training code in your own cluster via torchrun
directly, you can do
torchrun --nnodes=1 --nproc_per_node=8 train.py --train-inputs-s3-prefix s3://<your-s3-bucket-name>/toy-combined-dataset-example/optimized-data/train/ --train-inputs 0,1,2 --val-input s3://<your-s3-bucket-name>/toy-combined-dataset-example/optimized-data/val/ --train-weight-factors 0.2,0.6,0.2 --precision bf16-mixed --accumulate_grad_batches 1 --batch-size 32 --gradient_clip_val 5.0 --gradient_clip_algorithm norm --num-workers 4 --num_nodes 1 --enable_progress_bar False --sync_batchnorm True --accelerator auto --devices auto --log_every_n_steps 10 --output-dir /home/ec2-user/SageMaker/toy-combined-dataset-model/ --strategy ddp --max_epochs 20 --check_val_every_n_epoch 1
or replace --nnodes=2
or something else.
NOTE that I only replicated the error using the SageMaker training job approach above, but I don't think there's any significant difference between running it there versus on a self-managed cluster, since under the hood Sagemaker will execute a very similar torchrun command as above.
With above code example and arguments used I got a softlock to occur at around epoch 5 with 2 nodes. With 1 node it runs fine.
OK, another update. I tried running the same code but without the CombinedStreamingDataset class. Instead I used a single StreamingDataset class for the training set. With 2 nodes, this still leads to a NCCL timeout error. So I am wondering if there might be two different process groups being used somehow?
My next step will be to try getting a multi node sagemaker training job working with the same code but replacing the dataset/data loader with standard torch dataset and DataLoader class. If that doesn't work then I suppose this issue is moot and the problem is something else. But it would be very useful to a lot of folks in general to be able to use LitData and Lightning effectively with multinode sagemaker training jobs.
Hey @hubenjm Could you check the dataset length or the number of batch read on each rank ? This can happen if somehow the length wasn't inferred properly and one rank gets more data. We thought we fixed all of them but it seems there might be some issues still. You can try the Lightning Platform if you want to try multi node with lot of ease.
Hey @hubenjm. If you are available next week, let's try to get us to reproduce this issue on Lightning.ai. If I can reproduce it, I can fix it.
@tchaton Sure I will try to help out. As an update, I ran some more tests a couple weeks ago and I found the following specific to SageMaker
use_distributed_sampler
to True. I can work on streamlining my code example more to make it easier to work with. My current guess is that the problem lies somehow with how the distributed process group is being set up with StreamingDataLoader vs with the standard torch DataLoader. And maybe it has to do with some behind the scenes setup that SageMaker does with environment variables and in renaming the hosts as 'algo-1', 'algo-2', etc.
Hey @hubenjm. This happens if the number of batches isn't the same on all ranks. For the training streaming dataset, do you provide drop_last=True
.
Yes, a reproducible example would be super helpful.
Hey @hubenjm. This happens if the number of batches isn't the same on all ranks. For the training streaming dataset, do you provide
drop_last=True
.Yes, a reproducible example would be super helpful.
Yes I do set drop_last=True
in both the StreamingDataSet and StreamingDataLoader classes.
I have a new code example that I have verified fails in SageMaker with 2 nodes using StreamingDataset and works fine using a simple random image generator class and standard torch DataLoader. Will attach below. It includes a readme file with instructions on how to reproduce in SageMaker. I suppose the next step step would be to adapt the same example to work with Lightning Studio and see if it works there.
litdata_multinode_example_code.tar.gz
From README.md in .tar.gz attached:
This code is intended to test out ability to run distributed (DDP) training jobs in SageMaker with multiple nodes using PyTorch Lightning with or without LitData StreamingDataset as the data source.
constants.py
file, change MY_S3_PREFIX
to your own s3 bucket prefix that you want to use for storing data and artifactsgenerate_optimized_data.py
script. Specifically:
cd <directory_of_this_code_folder>
source activate pytorch_p310
pip install litdata
python generate_optimized_data.py
submit_training_job.ipynb
notebook
num_nodes
parameter to 2 to initiate multi-node training job. You can also specify instance_type
.use-litdata
parameter to True
or False
to run training code with litdata StreamingDataSet
or using native Pytorch local random image dataset with standard DataLoader
classuse-litdata = False
and num_nodes = 2
, but fails when use-litdata = True
Thanks @hubenjm. Multi node on Lightning.AI is much simpler and cheaper than Sagemaker. You should give it a try. It also support fault tolerance with automatic restart.
Here are the docs: https://lightning.ai/docs/overview/train-models/multi-node-training.
I will try to find some time to look into this. Thanks.
🐛 Bug
I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using
train_weight_factors = [0.8,0.07,0.07,0.07]
. The training stops printing out log messages after some fixed number of batches (depending on random seed I guess). Where the training stops is deterministic if seed is fixed, based on my experiments. Then the NCCL timeout triggers an exception after 30 minutes. The training code works fine on a single node though.To Reproduce
Use CombinedStreamingDataset for training dataset with
train_weight_factors
notNone
anditerate_over_all = False
. Launch training withtorchrun
with num_nodes > 1.Code sample
Expected behavior
Training should not softlock in the middle of an epoch
Environment
conda
,pip
, source): SageMaker prebuilt deep learning container (763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker, see https://github.com/aws/deep-learning-containers/blob/master/available_images.md)Additional context
If you have any other suggestions about why multi-node training with CombinedDataset would fail like this, any help is appreciated.