Open DC-Lin opened 1 year ago
请提供完整的报错信息,并确保格式正常
在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer
在trainer_seq2seq.py中使用transformers的训练器试试, from transformers.trainer import Trainer
感谢老师,改完还需要把Seq2SeqTrainer中的继承关系改为Trainer,并且save_changed参数需要注释掉,不知道save_changed这个参数是否对模型的保存产生影响
Is there an existing issue for this?
Current Behavior
ValueError: None is not in list [2023-07-06 06:21:42,568] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27275) of binary: /data/miniconda3/envs/nlp_tf2x/bin/python Traceback (most recent call last): File "/data/miniconda3/envs/nlp_tf2x/bin/torchrun", line 8, in
sys.exit(main())
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/miniconda3/envs/nlp_tf2x/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main.py FAILED
Failures: