THUDM / CogVLM

a state-of-the-art-level open visual language model | 多模态预训练模型
Apache License 2.0
6.08k stars 415 forks source link

LoRA 合并模型报错 #451

Open HarrytheOrange opened 7 months ago

HarrytheOrange commented 7 months ago

System Info / 系統信息

4x A800

Who can help? / 谁可以帮助到您?

@1049451037

Information / 问题信息

Reproduction / 复现过程

torchrun --standalone --nnodes=1 --nproc-per-node=4 utils/merge_model.py --version base --from_pretrained /mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/checkpoints/finetune-cogvlm-base-490-04-10-12-50

Traceback (most recent call last): File "/mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/utils/merge_model.py", line 42, in main() File "/mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/utils/merge_model.py", line 23, in main model, model_args = FineTuneTestCogVLMModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/model/base_model.py", line 257, in from_pretrained mp_merge_model_rank0(model, model_full) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 112, in mp_merge_model_rank0 iter_merge(model, model_full) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) [Previous line repeated 5 more times] File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 110, in itermerge p.data.copy(torch.clone(np.data.cpu()).detach()) RuntimeError: The size of tensor a (1792) must match the size of tensor b (448) at non-singleton dimension 0 [2024-04-10 22:28:49,072] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042137 closing signal SIGTERM [2024-04-10 22:28:49,072] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042138 closing signal SIGTERM [2024-04-10 22:28:49,073] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042139 closing signal SIGTERM [2024-04-10 22:28:49,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1042136) of binary: /mnt/cache/huangzhiyuan/env/thudm/bin/python Traceback (most recent call last): File "/mnt/cache/huangzhiyuan/env/thudm/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

utils/merge_model.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-10_22:28:49 host : pt-ryutnbhj-worker-0.pt-ryutnbhj.ns-operations-a5acdc67.svc.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 1042136) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Expected behavior / 期待表现 正常合并模型
elesun2018 commented 6 months ago

其他模型能合并吗 merge:原模型参数里面要有Att QKV orignal和Att QKV maxtriAB合并后为 Att QKV