AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023).
MIT License
231
stars
23
forks
source link
When I use multi-GPU training on a single machine, the following error is reported:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 470401) of binary: /home/sqh/miniconda3/envs/NLU/bin/python #17
When I use the command:python -m torch.distributed.launch --master_port=8679 --nproc_per_node=2 examples/text-classification/run_glue.py --model_name_or_path /home/sqh/code/models/deberta-v3-base --task_name mnli --apply_adalora --apply_lora --lora_type svd --target_rank 1 --lora_r 3 --reg_orth_coef 0.1 --init_warmup 8000 --final_warmup 50000 --mask_interval 100 --beta1 0.85 --beta2 0.85 --lora_module query,key,value,intermediate,layer.output,attention.output --lora_alpha 16 --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 32 --learning_rate 5e-4 --num_train_epochs 7 --warmup_steps 1000 --cls_dropout 0.15 --weight_decay 0 --evaluation_strategy steps --eval_steps 3000 --save_strategy steps --save_steps 30000 --logging_steps 500 --seed 6 --root_output_dir ./output/lora/glue/mnli --overwrite_output_dir I encounter the following error:Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 754, in
main()
File "examples/text-classification/run_glue.py", line 674, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/sqh/code/AdaLoRA/NLU/src/transformers/trainer.py", line 937, in train
self.rankallocator.set_total_step(max_steps)
File "/home/sqh/code/AdaLoRA/loralib/loralib/adalora.py", line 162, in set_total_step
assert self.total_step>self.initial_warmup+self.final_warmup
AssertionError
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 754, in
main()
File "examples/text-classification/run_glue.py", line 674, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/sqh/code/AdaLoRA/NLU/src/transformers/trainer.py", line 937, in train
self.rankallocator.set_total_step(max_steps)
File "/home/sqh/code/AdaLoRA/loralib/loralib/adalora.py", line 162, in set_total_step
assert self.total_step>self.initial_warmup+self.final_warmup
AssertionErrorERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 470401) of binary: /home/sqh/miniconda3/envs/NLU/bin/pythonTraceback (most recent call last): File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Other Failures:
[1]:
time: 2023-11-16_09:59:54
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 470402) error_file: <N/A>
msg: "Process failed with exitcode 1"***‘but when I use a single GPU, the system works normally.
When I use the command:
main()
File "examples/text-classification/run_glue.py", line 674, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/sqh/code/AdaLoRA/NLU/src/transformers/trainer.py", line 937, in train
self.rankallocator.set_total_step(max_steps)
File "/home/sqh/code/AdaLoRA/loralib/loralib/adalora.py", line 162, in set_total_step
assert self.total_step>self.initial_warmup+self.final_warmup
AssertionError
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 754, in
main()
File "examples/text-classification/run_glue.py", line 674, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/sqh/code/AdaLoRA/NLU/src/transformers/trainer.py", line 937, in train
self.rankallocator.set_total_step(max_steps)
File "/home/sqh/code/AdaLoRA/loralib/loralib/adalora.py", line 162, in set_total_step
assert self.total_step>self.initial_warmup+self.final_warmup
AssertionErrorERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 470401) of binary: /home/sqh/miniconda3/envs/NLU/bin/pythonTraceback (most recent call last): File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sqh/miniconda3/envs/NLU/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
python -m torch.distributed.launch --master_port=8679 --nproc_per_node=2 examples/text-classification/run_glue.py --model_name_or_path /home/sqh/code/models/deberta-v3-base --task_name mnli --apply_adalora --apply_lora --lora_type svd --target_rank 1 --lora_r 3 --reg_orth_coef 0.1 --init_warmup 8000 --final_warmup 50000 --mask_interval 100 --beta1 0.85 --beta2 0.85 --lora_module query,key,value,intermediate,layer.output,attention.output --lora_alpha 16 --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 32 --learning_rate 5e-4 --num_train_epochs 7 --warmup_steps 1000 --cls_dropout 0.15 --weight_decay 0 --evaluation_strategy steps --eval_steps 3000 --save_strategy steps --save_steps 30000 --logging_steps 500 --seed 6 --root_output_dir ./output/lora/glue/mnli --overwrite_output_dir
I encounter the following error:Traceback (most recent call last): File "examples/text-classification/run_glue.py", line 754, inexamples/text-classification/run_glue.py FAILED
Root Cause:[0]: time: 2023-11-16_09:59:54 rank: 0 (local_rank: 0) exitcode: 1 (pid: 470401) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: [1]: time: 2023-11-16_09:59:54 rank: 1 (local_rank: 1) exitcode: 1 (pid: 470402) error_file: <N/A> msg: "Process failed with exitcode 1"***‘but when I use a single GPU, the system works normally.