Bugs Fixing and Supporting for Multi-nodes

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi,

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi, I have some problems when I train with Multi-nodes,

The information is too little to know what happened.

The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first

The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first May I ask what your instructions are for using multi-node

facebookresearch / DiT

Bugs Fixing and Supporting for Multi-nodes #79