Closed tfal-yan closed 11 months ago
建议回退transformers版本到4.31.0,是aquila2支持的版本。https://github.com/FlagAI-Open/FlagAI/issues/556
由于 transformers 更新版本会不兼容,可以删除缺少的函数做下调整。
Failures:
--nproc_per_node=1 修改成卡数?
--nproc_per_node=2 GPU也配置4,3 2个,还是失败 WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Failures: [1]: time : 2023-11-30_11:39:43
export CUDA_VISIBLE_DEVICES="4,3,2,1,0"; bash finetune/7B/finetune_qlora_single_node.sh
2023-11-30 14:14:03,335] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/data0/testCase/Aquila2/finetune/finetune.py", line 481, in
以下文件也需要同步更新吗: Aquila2/finetune
/finetune.py ~
2023-11-30 14:14:03,335] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/data0/testCase/Aquila2/finetune/finetune.py", line 481, in train() File "/data0/testCase/Aquila2/finetune/finetune.py", line 350, in train ) = parser.parse_args_into_dataclasses() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/transformers/hf_argparser.py", line 347, in parse_args_into_dataclasses raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_single_node', 'True']
以下文件也需要同步更新吗: Aquila2/finetune
/finetune.py ~
是的,支持了新的开关。
已经ok了,34B模型用qlora,单机脚本,多谢
^MLoading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]^MLoading checkpoint shards: 33%|███▎ | 1/3 [00:13<00:27, 13.84s/it]^MLoading checkpoint shards: 67%|██████▋ | 2/3 [00:25<00:12, 12.46s/it]^MLoading checkpoint shards: 100%|██████████| 3/3 [00:35<00:00, 11.43s/it]^MLoading checkpoint shards: 100%|██████████| 3/3 [00:35<00:00, 11.85s/it] Traceback (most recent call last): File "/data0/testCase/Aquila2/finetune/finetune.py", line 481, in
train()
File "/data0/testCase/Aquila2/finetune/finetune.py", line 399, in train
model = prepare_model_for_kbit_training(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/peft/utils/other.py", line 130, in prepare_model_for_kbit_training
model.gradient_checkpointing_enable(*gc_enable_kwargs)
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/transformers-4.35.0-py3.11.egg/transformers/modeling_utils.py", line 1872, in gradient_checkpointing_enable
self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=gradient_checkpointing_func)
TypeError: AquilaPreTrainedModel._set_gradient_checkpointing() got an unexpected keyword argument 'enable'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2585097) of binary: /root/anaconda3/envs/testCase/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/testCase/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f( args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/testCase/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: