Open WangWenhao0716 opened 7 months ago
Thanks for this excellent work.
When I try to run it on multi-nodes with 16 GPUs, there is an error:
Traceback (most recent call last): File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module> main(args) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main loss_dict = diffusion.training_losses(model, x, t, model_kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses return super().training_losses(self._wrap_model(model), *args, **kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses model_output = model(x_t, t, **model_kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__ return self.model(x, new_ts, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward inputs, kwargs = _to_kwargs( File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to res = to_map(inputs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map return list(zip(*map(to_map, obj))) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map stream = _get_stream(target_gpu) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream if _streams[device] is None: IndexError: list index out of range
After investigating, I find it comes from the 149th line in the
train.py
:model = DDP(model.to(device), device_ids=[rank])
It should be:
model = DDP(model.to(device), device_ids=[device])
Then everything work normally.
Hi,
Thanks for this excellent work.
When I try to run it on multi-nodes with 16 GPUs, there is an error:
Traceback (most recent call last): File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module> main(args) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main loss_dict = diffusion.training_losses(model, x, t, model_kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses return super().training_losses(self._wrap_model(model), *args, **kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses model_output = model(x_t, t, **model_kwargs) File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__ return self.model(x, new_ts, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward inputs, kwargs = _to_kwargs( File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to res = to_map(inputs) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map return list(zip(*map(to_map, obj))) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map stream = _get_stream(target_gpu) File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream if _streams[device] is None: IndexError: list index out of range
After investigating, I find it comes from the 149th line in the
train.py
:model = DDP(model.to(device), device_ids=[rank])
It should be:
model = DDP(model.to(device), device_ids=[device])
Then everything work normally.
Hi, I have some problems when I train with Multi-nodes,
The information is too little to know what happened.
The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail
The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail
Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first
The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail
Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first May I ask what your instructions are for using multi-node
Thanks for this excellent work.
When I try to run it on multi-nodes with 16 GPUs, there is an error:
After investigating, I find it comes from the 149th line in the
train.py
:It should be:
Then everything work normally.