关于多卡训练的问题

Tian14267 commented 1 year ago

大神好，我在进行单卡训练和多卡训练的时候，遇到很多很奇怪的问题，比如：当我单卡训练的时候，需要修改代码为下面这个才能正常训练：但是如果我用上图这个代码进行多卡训练，就会报错：


/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Traceback (most recent call last):
  File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
Traceback (most recent call last):
  File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
        return inner_training_loop(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
return inner_training_loop(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
    model = self._wrap_model(self.model_wrapped)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
    model = nn.parallel.DistributedDataParallel(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
    self._log_and_throw(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
    self._log_and_throw(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
    raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'cpu'}.
    raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cpu', 'cuda'}.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 98320) of binary: /root/anaconda3/envs/chinesevicuna/bin/python3.10
Traceback (most recent call last):
  File "/root/anaconda3/envs/chinesevicuna/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune_fffan.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-04-20_17:41:48
  host      : gpu19
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 98321)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-20_17:41:48
  host      : gpu19
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 98320)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

提示我设备不是同一个设备？这是啥情况？ device_map的问题吗？

另外，我采用上图注释掉的进行多卡训练的话，就会提示 CUDA out of memory。我的batch_size以及调整到16了，还是会out of memory。这啥情况。（单卡的batch给到128都没问题）

Data2Me commented 1 year ago

我也是报这个错：torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ，请问你解决了没

jasonSunhu commented 1 year ago

@Tian14267 请问如何解决的

JupyterChu commented 1 year ago

三张3090报了同样问题，请问如何解决？

Facico commented 1 year ago

torch.distributed.elastic.multiprocessing.errors.ChildFailedError 这个只有程序异常中断都会有这个错误，比如哪个库导致的这个问题，比如程序被kill了之类的,导致这个错误的情况太多了，一般判断哪里有错误不会看这个地方的，只能当做一个程序退出信号。原问题开了一个新的issue，可以参考这里

Facico / Chinese-Vicuna

关于多卡训练的问题 #96