QingruZhang / AdaLoRA

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023).
MIT License
259 stars 28 forks source link

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient. #18

Open DigitalLifeYZQiu opened 10 months ago

DigitalLifeYZQiu commented 10 months ago

Hi authors, thanks for posting the source code of AdaLoRA, I found this fine-tuning method very important and interesting! Currently, I am trying to run this piece of source code, ./scripts/run_bart_xsum.sh in particular. I am confused by the assertion error which is posted on the title of this issue.

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

I have made some minor changes to the script as follow by changing the model path and adding the CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7
accelerate launch --multi_gpu --num_machine=1 --num_processes=8 \
--main_process_port=8679 --mixed_precision="no" \
examples/summarization/run_summarization_no_trainer.py \
--model_name_or_path ./bart-large \
--dataset_name xsum \
--apply_lora --apply_adalora \
--lora_type svd --target_rank 8 --lora_r 12 \
--lora_alpha 32 \
--reg_orth_coef 0.1 \
--init_warmup 6000 --final_warmup 25000 --mask_interval 100 \
--beta1 0.85 --beta2 0.85 \
--lora_module q_proj,k_proj,v_proj,out_proj,fc1,fc2 \
--per_device_train_batch_size 8 --learning_rate 5e-4 \
--num_train_epochs 25 --num_warmup_steps 3000 \
--max_source_length 768 --max_target_length 64 --max_length 768 \
--pad_to_max_length --num_beams 8 \
--per_device_eval_batch_size 8 \
--seed 9 \
--with_tracking \
--tb_writter_loginterval 500 \
--output_dir ./output/bart-large/xsum 

The code did not run as I expected. The model and datasets seem to be correctly loaded, but the training process is errored:

Traceback (most recent call last):
  File "examples/summarization/run_summarization_no_trainer.py", line 954, in <module>
    main()
  File "examples/summarization/run_summarization_no_trainer.py", line 703, in main
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 482, in prepare
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 482, in <genexpr>
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 378, in _prepare_one
    return self.prepare_model(obj)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/accelerator.py", line 506, in prepare_model
    model, device_ids=[self.local_process_index], output_device=self.local_process_index, **kwargs
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 394, in __init__
    "DistributedDataParallel is not needed when a module "
AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20831) of binary: /home/qiuyunzhong/anaconda3/envs/NLG/bin/python
/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 20831 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 702, in <module>
    main()
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
    return f(*args, **kwargs)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 698, in main
    run(args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
    )(*cmd_args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*****************************************************************
  examples/summarization/run_summarization_no_trainer.py FAILED  
=================================================================
Root Cause:
[0]:
  time: 2023-11-20_17:07:26
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 20831)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=================================================================
Other Failures:
[1]:
  time: 2023-11-20_17:07:26
  rank: 3 (local_rank: 3)
  exitcode: 1 (pid: 20834)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[2]:
  time: 2023-11-20_17:07:26
  rank: 5 (local_rank: 5)
  exitcode: 1 (pid: 20837)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[3]:
  time: 2023-11-20_17:07:26
  rank: 6 (local_rank: 6)
  exitcode: 1 (pid: 20838)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[4]:
  time: 2023-11-20_17:07:26
  rank: 7 (local_rank: 7)
  exitcode: 1 (pid: 20843)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
*****************************************************************

Traceback (most recent call last):
  File "/home/qiuyunzhong/anaconda3/envs/NLG/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/launch.py", line 562, in launch_command
    multi_gpu_launcher(args)
  File "/home/qiuyunzhong/anaconda3/envs/NLG/lib/python3.7/site-packages/accelerate/commands/launch.py", line 306, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/qiuyunzhong/anaconda3/envs/NLG/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '8', '--master_port', '8679', 'examples/summarization/run_summarization_no_trainer.py', '--model_name_or_path', './bart-large', '--dataset_name', 'xsum', '--apply_lora', '--apply_adalora', '--lora_type', 'svd', '--target_rank', '8', '--lora_r', '12', '--lora_alpha', '32', '--reg_orth_coef', '0.1', '--init_warmup', '6000', '--final_warmup', '25000', '--mask_interval', '100', '--beta1', '0.85', '--beta2', '0.85', '--lora_module', 'q_proj,k_proj,v_proj,out_proj,fc1,fc2', '--per_device_train_batch_size', '8', '--learning_rate', '5e-4', '--num_train_epochs', '25', '--num_warmup_steps', '3000', '--max_source_length', '768', '--max_target_length', '64', '--max_length', '768', '--pad_to_max_length', '--num_beams', '8', '--per_device_eval_batch_size', '8', '--seed', '9', '--with_tracking', '--tb_writter_loginterval', '500', '--output_dir', './output/bart-large/xsum']' returned non-zero exit status 1.

This seems to be an error about Accelerate or DistributedDataParallel. I have no idea what caused this error as I strictly followed the instructions in README.md. My virtual environment setting is:

absl-py==1.1.0
accelerate==0.10.0
aiohttp==3.8.6
aiosignal==1.3.1
async-timeout==4.0.3
asynctest==0.13.0
attrs==23.1.0
Brotli==1.1.0
cachetools==5.3.2
certifi @ file:///croot/certifi_1671487769961/work/certifi
charset-normalizer==3.3.2
click==8.1.7
datasets==2.3.2
dill==0.3.5.1
filelock==3.12.2
frozenlist==1.3.3
fsspec==2023.1.0
google-auth==2.23.4
google-auth-oauthlib==0.4.6
grpcio==1.59.2
huggingface-hub==0.16.4
idna==3.4
importlib-metadata==6.7.0
inflate64==0.3.1
joblib==1.3.2
-e git+https://github.com/QingruZhang/AdaLoRA.git@d10f5ebee16c478fa2f41a44a237b38e8c9b0338#egg=loralib&subdirectory=loralib
Markdown==3.4.4
MarkupSafe==2.1.3
multidict==6.0.4
multiprocess==0.70.13
multivolumefile==0.2.3
nltk==3.8.1
numpy==1.21.6
oauthlib==3.2.2
packaging==23.2
pandas==1.3.5
Pillow==9.5.0
protobuf==3.17.3
psutil==5.9.6
py7zr==0.20.6
pyarrow==8.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pybcj==1.0.1
pycryptodomex==3.19.0
pyppmd==1.0.0
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
pyzstd==0.15.9
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
responses==0.18.0
rouge-score==0.1.2
rsa==4.9
scipy==1.7.3
sentencepiece==0.1.96
six==1.16.0
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorboardX==2.6
texttable==1.7.0
tokenizers==0.12.1
torch==1.9.1+cu111
torchaudio==0.9.1
torchvision==0.10.1+cu111
tqdm==4.66.1
transformers==4.21.0
typing_extensions==4.7.1
urllib3==2.0.7
Werkzeug==2.2.3
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.2
zipp==3.15.0

Am I doing something wrong? Do I need to apply additional settings?