Closed greenteaofwhu closed 1 year ago
maybe PR https://github.com/OpenBMB/BMTrain/pull/77 related, you can try the version of bmtrain in the PR
我当前有两个任务,task1和task2,其中task2需要基于task1训练完的checkpoint来训练。但是发现task1训练完生成的best.pt可以在task1的infer过程中被加载,但是无法在task2的tune过程中加载,明明model的state_dict中的key和best.pt的state_dict中的key都是一致的,但是却报错有Unexpected key(s) in state_dict: "generator.encoder.layers.0.self_att.self_attention.project_q.lora.lora_A",请问下是什么原因导致的呢?
请问下你是怎么读取基于task1训练完的checkpoint的呢 直接这样: model.load_state_dict(torch.load(args.LoRA_path), strict=False) 还是有其他方式。 我读取没有报错,但是测试的效果很差,感觉权重并没有load上。
Have you tried the https://github.com/OpenBMB/BMTrain/pull/77 version? I have tested on it.
helle:Hi, I tried the method you provided and it still doesn't work. I have a question: The key of the LORA weight I saved is: encoder.layers.30.self_att.self_attention.project_q.lora.lora_A encoder.layers.30.self_att.self_attention.project_q.lora.lora_B
but The weight key required by the model is as follows encoder.layers.0.self_att.self_attention.project_q.weight
How is this inconsistency caused?
encoder.layers.0.self_att.self_attention.project_q.weight
is in LLM, while encoder.layers.30.self_att.self_attention.project_q.lora.lora_A
is in LoRA.
Natively, BMTrain put the model parameter into something called "checkpoint block" and only those parameter in checkpoint block is loaded when bmt.load
is used. So the OpenDelta's later injected parameters cannot be loaded.
In https://github.com/OpenBMB/BMTrain/pull/77, we add a special if
to handle these later inserted parameters.
If still not working, could you please go into the code where your BMTrain is installed, add print after the line https://github.com/OpenBMB/BMTrain/pull/77/files#diff-3d2ebf2a27c806bc06c8d5506d93ca127d2ae907832a523fd1b53e62d8a59e51R532, and see if the special if
is entered when the param name includes lora
.
Thank you, the previous problem was solved, but a new problem appeared::
[INFO] Tuning begins...
Traceback (most recent call last):
File "tune_cpm_ant_load_checkpoint.py", line 70, in
Thank you, the previous problem was solved, but a new problem appeared::
[INFO] Tuning begins... Traceback (most recent call last): File "tune_cpm_ant_load_checkpoint.py", line 70, in tune = config_dict["tune"]( File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 225, in init super().init(kwargs) File "/home/liweiqing/CPM-Live_plus/CPM-Live/cpm-live/examples/tune.py", line 57, in init self.optimizer = bmt.optim.AdamOffloadOptimizer( TypeError: init() got an unexpected keyword argument 'scale' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 58593) of binary: /home/liweiqing/anaconda3/envs/cpm/bin/python Traceback (most recent call last): File "/home/liweiqing/anaconda3/envs/cpm/bin/torchrun", line 8, in sys. exit(main()) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init*.py", line 346, in wrapper return f(args, kwargs) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/liweiqing/anaconda3/envs/cpm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
We have adapted this project to the latest version of BMTrain, this issue is solved.
我当前有两个任务,task1和task2,其中task2需要基于task1训练完的checkpoint来训练。但是发现task1训练完生成的best.pt可以在task1的infer过程中被加载,但是无法在task2的tune过程中加载,明明model的state_dict中的key和best.pt的state_dict中的key都是一致的,但是却报错有Unexpected key(s) in state_dict: "generator.encoder.layers.0.self_att.self_attention.project_q.lora.lora_A",请问下是什么原因导致的呢?