Finetune task2任务无法基于已训练完的Finetune task1的best.pt继续训练？

greenteaofwhu commented 1 year ago

我当前有两个任务，task1和task2，其中task2需要基于task1训练完的checkpoint来训练。但是发现task1训练完生成的best.pt可以在task1的infer过程中被加载，但是无法在task2的tune过程中加载，明明model的state_dict中的key和best.pt的state_dict中的key都是一致的，但是却报错有Unexpected key(s) in state_dict: "generator.encoder.layers.0.self_att.self_attention.project_q.lora.lora_A",请问下是什么原因导致的呢？

Achazwl commented 1 year ago

maybe PR https://github.com/OpenBMB/BMTrain/pull/77 related, you can try the version of bmtrain in the PR

liweiqing1997 commented 1 year ago

我当前有两个任务，task1和task2，其中task2需要基于task1训练完的checkpoint来训练。但是发现task1训练完生成的best.pt可以在task1的infer过程中被加载，但是无法在task2的tune过程中加载，明明model的state_dict中的key和best.pt的state_dict中的key都是一致的，但是却报错有Unexpected key(s) in state_dict: "generator.encoder.layers.0.self_att.self_attention.project_q.lora.lora_A",请问下是什么原因导致的呢？

请问下你是怎么读取基于task1训练完的checkpoint的呢直接这样： model.load_state_dict(torch.load(args.LoRA_path), strict=False) 还是有其他方式。我读取没有报错，但是测试的效果很差，感觉权重并没有load上。

Achazwl commented 1 year ago

Have you tried the https://github.com/OpenBMB/BMTrain/pull/77 version? I have tested on it.

liweiqing1997 commented 1 year ago

helle：Hi, I tried the method you provided and it still doesn't work. I have a question: The key of the LORA weight I saved is: encoder.layers.30.self_att.self_attention.project_q.lora.lora_A encoder.layers.30.self_att.self_attention.project_q.lora.lora_B

but The weight key required by the model is as follows encoder.layers.0.self_att.self_attention.project_q.weight

How is this inconsistency caused?

Achazwl commented 1 year ago

encoder.layers.0.self_att.self_attention.project_q.weight is in LLM, while encoder.layers.30.self_att.self_attention.project_q.lora.lora_A is in LoRA. Natively, BMTrain put the model parameter into something called "checkpoint block" and only those parameter in checkpoint block is loaded when bmt.load is used. So the OpenDelta's later injected parameters cannot be loaded. In https://github.com/OpenBMB/BMTrain/pull/77, we add a special if to handle these later inserted parameters. If still not working, could you please go into the code where your BMTrain is installed, add print after the line https://github.com/OpenBMB/BMTrain/pull/77/files#diff-3d2ebf2a27c806bc06c8d5506d93ca127d2ae907832a523fd1b53e62d8a59e51R532, and see if the special if is entered when the param name includes lora.

liweiqing1997 commented 1 year ago

Thank you, the previous problem was solved, but a new problem appeared::

[INFO] Tuning begins... Traceback (most recent call last): File "tune_cpm_ant_load_checkpoint.py", line 70, in tune = config_dict["tune"]( File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 225, in init super().init(*kwargs) File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 57, in init self.optimizer = bmt.optim.AdamOffloadOptimizer( TypeError: init() got an unexpected keyword argument 'scale' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 58593) of binary: /home/liweiqing/anaconda3/envs/cpm/bin/python Traceback (most recent call last): File "/home/liweiqing/anaconda3/envs/cpm/bin/torchrun", line 8, in sys. exit(main()) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

zh-zheng commented 1 year ago

Thank you, the previous problem was solved, but a new problem appeared::

[INFO] Tuning begins... Traceback (most recent call last): File "tune_cpm_ant_load_checkpoint.py", line 70, in tune = config_dict["tune"]( File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 225, in init super().init(kwargs) File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 57, in init self.optimizer = bmt.optim.AdamOffloadOptimizer( TypeError: init() got an unexpected keyword argument 'scale' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 58593) of binary: /home/liweiqing/anaconda3/envs/cpm/bin/python Traceback (most recent call last): File "/home/liweiqing/anaconda3/envs/cpm/bin/torchrun", line 8, in sys. exit(main()) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init*.py", line 346, in wrapper return f(args, kwargs) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

We have adapted this project to the latest version of BMTrain, this issue is solved.

OpenBMB / CPM-Live

Finetune task2任务无法基于已训练完的Finetune task1的best.pt继续训练？ #400