OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.39k stars 248 forks source link

How to enable multi-gpus training? #352

Closed JulioZhao97 closed 1 year ago

JulioZhao97 commented 1 year ago

when i try to pretrain base model, I enable multi-gpus training in pretrain_base.sh by this:

export CUDA_VISIBLE_DEVICES=0,1,2,3
export GPUS_PER_NODE=4

but error occurs as follows:

2023-02-10 11:17:26 - utils.py[line:759] - INFO: ***********************CUDA enviroments for all 4 workers***********************
2023-02-10 11:17:26 - utils.py[line:765] - INFO: rank   0: capabilities =  8.0  ; total memory = 79.347 GB ; name = NVIDIA A100-SXM4-80GB                   
2023-02-10 11:17:26 - utils.py[line:765] - INFO: rank   1: capabilities =  8.0  ; total memory = 79.347 GB ; name = NVIDIA A100-SXM4-80GB                   
2023-02-10 11:17:26 - utils.py[line:765] - INFO: rank   2: capabilities =  8.0  ; total memory = 79.347 GB ; name = NVIDIA A100-SXM4-80GB                   
2023-02-10 11:17:26 - utils.py[line:765] - INFO: rank   3: capabilities =  8.0  ; total memory = 79.347 GB ; name = NVIDIA A100-SXM4-80GB                   
2023-02-10 11:17:26 - utils.py[line:767] - INFO: ***********************CUDA enviroments for all 4 workers***********************
2023-02-10 11:17:26 - train.py[line:154] - INFO: training on 4 devices (GPUs/TPUs)
2023-02-10 11:17:26 - train.py[line:160] - INFO: max tokens per device = None and max sentences per device = 4
local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 2 begin to initialize row_count and line_idx-to-offset mappinglocal datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 1 begin to initialize row_count and line_idx-to-offset mapping

local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 3 begin to initialize row_count and line_idx-to-offset mapping
2023-02-10 11:17:26 - trainer.py[line:458] - INFO: Preparing to load checkpoint ../../checkpoints/ofa_base.pt
2023-02-10 11:17:26 - trainer.py[line:624] - INFO: No existing checkpoint found ../../checkpoints/ofa_base.pt
2023-02-10 11:17:26 - trainer.py[line:639] - INFO: loading train data for epoch 1
local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 0 begin to initialize row_count and line_idx-to-offset mapping
local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 2 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 2 row count 75 total row count 300
local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 0 finished initializing row_count and line_idx-to-offset mappinglocal datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 3 finished initializing row_count and line_idx-to-offset mapping

file ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 0 row count 75 total row count 300file ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 3 row count 75 total row count 300

local datafile ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 1 finished initializing row_count and line_idx-to-offset mapping
file ../../dataset/pretrain_data/vision_language_examples.tsv slice_id 1 row count 75 total row count 300
/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torchvision/transforms/functional.py:405: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torchvision/transforms/functional.py:405: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torchvision/transforms/functional.py:405: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torchvision/transforms/functional.py:405: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  "Argument interpolation should be of type InterpolationMode instead of int. "
slice_id 2 seek offset 150
slice_id 3 seek offset 225
slice_id 1 seek offset 75
Total steps 950, warmup steps 9, warmup_factor 0.1111111111111111
Total steps 950, warmup steps 9, warmup_factor 0.1111111111111111
slice_id 0 seek offset 0
Total steps 950, warmup steps 9, warmup_factor 0.1111111111111111
Total steps 950, warmup steps 9, warmup_factor 0.1111111111111111
2023-02-10 11:17:27 - trainer.py[line:703] - INFO: begin training epoch 1
2023-02-10 11:17:27 - train.py[line:305] - INFO: Start iterating over samples
Traceback (most recent call last):
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 879, in train_step
    self._check_grad_norms(grad_norm)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 1427, in _check_grad_norms
    + "-" * 80
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 43.87894113
rank   1 = 36.76721862
rank   2 = 32.03398652
rank   3 = 32.47642299

--------------------------------------------------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 538, in <module>
    cli_main()
  File "../../train.py", line 531, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 199, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 310, in train
    log_output = trainer.train_step(samples)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 916, in train_step
    **extra_kwargs,
  File "/nvme/zhaozhiyuan/hthl/OFA/tasks/ofa_task.py", line 334, in train_step
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/criterions/label_smoothed_cross_entropy.py", line 199, in forward
    net_output = model(**sample["net_input"])
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 443 446 447 456 457 670 671 672 673 674 675
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 879, in train_step
    self._check_grad_norms(grad_norm)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 1427, in _check_grad_norms
    + "-" * 80
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 43.87894113
rank   1 = 36.76721862
rank   2 = 32.03398652
rank   3 = 32.47642299

--------------------------------------------------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 538, in <module>
    cli_main()
  File "../../train.py", line 531, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 199, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 310, in train
Traceback (most recent call last):
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 879, in train_step
    log_output = trainer.train_step(samples)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 916, in train_step
        **extra_kwargs,self._check_grad_norms(grad_norm)
  File "/nvme/zhaozhiyuan/hthl/OFA/tasks/ofa_task.py", line 334, in train_step

  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 1427, in _check_grad_norms
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    + "-" * 80
    FloatingPointErrorreturn forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/criterions/label_smoothed_cross_entropy.py", line 199, in forward
: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 43.87894113
rank   1 = 36.76721862
rank   2 = 32.03398652
rank   3 = 32.47642299

--------------------------------------------------------------------------------    

During handling of the above exception, another exception occurred:

net_output = model(**sample["net_input"])Traceback (most recent call last):

  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
  File "../../train.py", line 538, in <module>
    cli_main()
      File "../../train.py", line 531, in cli_main
return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    distributed_utils.call_main(cfg, main)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)    return forward_call(*input, **kwargs)

  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
Traceback (most recent call last):
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 879, in train_step
    main(cfg, **kwargs)
      File "../../train.py", line 199, in main
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 443 446 447 456 457 670 671 672 673 674 675
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 310, in train
    log_output = trainer.train_step(samples)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
        return func(*args, **kwds)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 916, in train_step
self._check_grad_norms(grad_norm)    
**extra_kwargs,  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 1427, in _check_grad_norms

  File "/nvme/zhaozhiyuan/hthl/OFA/tasks/ofa_task.py", line 334, in train_step
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
        return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/criterions/label_smoothed_cross_entropy.py", line 199, in forward
    net_output = model(**sample["net_input"])
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
+ "-" * 80
    FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = 43.87894113
rank   1 = 36.76721862
rank   2 = 32.03398652
rank   3 = 32.47642299

--------------------------------------------------------------------------------
return self.module(*args, **kwargs)
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../../train.py", line 538, in <module>
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    cli_main()
  File "../../train.py", line 531, in cli_main
    return forward_call(*input, **kwargs)
    distributed_utils.call_main(cfg, main)  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward

  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
        if torch.is_grad_enabled() and self.reducer._rebuild_buckets():main(cfg, **kwargs)
  File "../../train.py", line 199, in main

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 443 446 447 456 457 670 671 672 673 674 675
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error    
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 310, in train
    log_output = trainer.train_step(samples)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nvme/zhaozhiyuan/hthl/OFA/trainer.py", line 916, in train_step
    **extra_kwargs,
  File "/nvme/zhaozhiyuan/hthl/OFA/tasks/ofa_task.py", line 334, in train_step
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/criterions/label_smoothed_cross_entropy.py", line 199, in forward
    net_output = model(**sample["net_input"])
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/hthl/OFA/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 443 446 447 456 457 670 671 672 673 674 675
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
terminate called without an active exception
terminate called without an active exception
terminate called without an active exception
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76079 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 76077) of binary: /nvme/zhaozhiyuan/anaconda3/envs/ofa/bin/python3
Traceback (most recent call last):
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/nvme/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-10_11:17:32
  host      : SH-IDCA1404-10-140-54-21
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 76078)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 76078
[2]:
  time      : 2023-02-10_11:17:32
  host      : SH-IDCA1404-10-140-54-21
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 76080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-10_11:17:32
  host      : SH-IDCA1404-10-140-54-21
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 76077)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 76077
============================================================

Could someone please kindly tell me how to fix this or how to enable multi-gpus training? thanks!

JulioZhao97 commented 1 year ago

solved by https://github.com/facebookresearch/fairseq/issues/3920

JJJYmmm commented 6 months ago

Hi! I met the same problem, did you just add –ddp-backend=no_c10d ?

sanyog96 commented 4 months ago

Same issue with me as well, I met the problem with --ddp-backend=no_c10d and even changing adam to nag also giving same error

EmreOzkose commented 1 week ago

I am facing the same issue except I got 0 for grad_norm:

grad_norm across the workers: rank 0 = 18.27746773 rank 1 = 0.00000000 rank 2 = 18.27746773 rank 3 = 18.27746773