Hi authors, thanks for posting the source code of AdaLoRA, I found this fine-tuning method very important and interesting! Currently, I am trying to run this piece of source code, ./scripts/run_bart_xsum.sh in particular. I am confused by the assertion error which is posted on the title of this issue.
AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
I have made some minor changes to the script as follow by changing the model path and adding the CUDA_VISIBLE_DEVICES:
The code did not run as I expected. The model and datasets seem to be correctly loaded, but the training process is errored:
Traceback (most recent call last):
File "examples/summarization/run_summarization_no_trainer.py", line 954, in <module>
main()
File "examples/summarization/run_summarization_no_trainer.py", line 703, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 482, in prepare
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 482, in <genexpr>
result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 378, in _prepare_one
return self.prepare_model(obj)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 506, in prepare_model
model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 394, in __init__
"DistributedDataParallel is not needed when a module "
AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20831) of binary: /home/qiuyunzhong/anaconda3/envs/NLG/bin/python
/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 20831 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 702, in <module>
main()
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
*****************************************************************
examples/summarization/run_summarization_no_trainer.py FAILED
=================================================================
Root Cause:
[0]:
time: 2023-11-20_17:07:26
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 20831)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=================================================================
Other Failures:
[1]:
time: 2023-11-20_17:07:26
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 20834)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2023-11-20_17:07:26
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 20837)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2023-11-20_17:07:26
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 20838)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2023-11-20_17:07:26
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 20843)
error_file: <N/A>
msg: "Process failed with exitcode 1"
*****************************************************************
Traceback (most recent call last):
File "/home/qiuyunzhong/anaconda3/envs/NLG/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/launch.py", line 562, in launch_command
multi_gpu_launcher(args)
File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/launch.py", line 306, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/qiuyunzhong/anaconda3/envs/NLG/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '8', '--master_port', '8679', 'examples/summarization/run_summarization_no_trainer.py', '--model_name_or_path', './bart-large', '--dataset_name', 'xsum', '--apply_lora', '--apply_adalora', '--lora_type', 'svd', '--target_rank', '8', '--lora_r', '12', '--lora_alpha', '32', '--reg_orth_coef', '0.1', '--init_warmup', '6000', '--final_warmup', '25000', '--mask_interval', '100', '--beta1', '0.85', '--beta2', '0.85', '--lora_module', 'q_proj,k_proj,v_proj,out_proj,fc1,fc2', '--per_device_train_batch_size', '8', '--learning_rate', '5e-4', '--num_train_epochs', '25', '--num_warmup_steps', '3000', '--max_source_length', '768', '--max_target_length', '64', '--max_length', '768', '--pad_to_max_length', '--num_beams', '8', '--per_device_eval_batch_size', '8', '--seed', '9', '--with_tracking', '--tb_writter_loginterval', '500', '--output_dir', './output/bart-large/xsum']' returned non-zero exit status 1.
This seems to be an error about Accelerate or DistributedDataParallel. I have no idea what caused this error as I strictly followed the instructions in README.md. My virtual environment setting is:
Hi authors, thanks for posting the source code of AdaLoRA, I found this fine-tuning method very important and interesting! Currently, I am trying to run this piece of source code,
./scripts/run_bart_xsum.sh
in particular. I am confused by the assertion error which is posted on the title of this issue.AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
I have made some minor changes to the script as follow by changing the model path and adding the
CUDA_VISIBLE_DEVICES
:The code did not run as I expected. The model and datasets seem to be correctly loaded, but the training process is errored:
This seems to be an error about
Accelerate
orDistributedDataParallel
. I have no idea what caused this error as I strictly followed the instructions inREADME.md
. My virtual environment setting is:Am I doing something wrong? Do I need to apply additional settings?