/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Traceback (most recent call last):
File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
Traceback (most recent call last):
File "/data1/fffan/5_NLP/4_ChineseVicuna/Chinese_Vicuna_0420/finetune_fffan.py", line 279, in <module>
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
return inner_training_loop(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
model = self._wrap_model(self.model_wrapped)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
model = self._wrap_model(self.model_wrapped)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/transformers/trainer.py", line 1569, in _wrap_model
model = nn.parallel.DistributedDataParallel(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
model = nn.parallel.DistributedDataParallel(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 565, in __init__
self._log_and_throw(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
self._log_and_throw(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cuda', 'cpu'}.
raise err_type(err_msg)
ValueError: DistributedDataParallel's input module must be on the same type of devices, but input module parameters locate in {'cpu', 'cuda'}.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 98320) of binary: /root/anaconda3/envs/chinesevicuna/bin/python3.10
Traceback (most recent call last):
File "/root/anaconda3/envs/chinesevicuna/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/chinesevicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune_fffan.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-04-20_17:41:48
host : gpu19
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 98321)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-20_17:41:48
host : gpu19
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 98320)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
提示我设备不是同一个设备?这是啥情况?device_map的问题吗?
另外,我采用上图注释掉的进行多卡训练的话,就会提示 CUDA out of memory。我的batch_size以及调整到16了,还是会out of memory。这啥情况。(单卡的batch给到128都没问题)
大神好,我在进行单卡训练和多卡训练的时候,遇到很多很奇怪的问题,比如: 当我单卡训练的时候,需要修改代码为下面这个才能正常训练: 但是如果我用上图这个代码进行多卡训练,就会报错:
提示我设备不是同一个设备?这是啥情况?
device_map
的问题吗?另外,我采用上图注释掉的进行多卡训练的话,就会提示
CUDA out of memory
。我的batch_size以及调整到16了,还是会out of memory。这啥情况。(单卡的batch给到128都没问题)