facebookresearch / DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Other
6.34k stars 568 forks source link

Bugs Fixing and Supporting for Multi-nodes #79

Open WangWenhao0716 opened 7 months ago

WangWenhao0716 commented 7 months ago

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

hustzyj commented 5 months ago

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi,

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi, I have some problems when I train with Multi-nodes,

屏幕截图 2024-06-09 131303
WangWenhao0716 commented 5 months ago

The information is too little to know what happened.

hustzyj commented 5 months ago

The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

WangWenhao0716 commented 5 months ago

The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first

hustzyj commented 5 months ago

The information is too little to know what happened. Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。 I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first May I ask what your instructions are for using multi-node