THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Apache License 2.0
4.05k stars 414 forks source link

烦请帮忙看看,微调后运行cli_demo.py出现维度不一致问题; RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

Open New-start-man opened 5 months ago

New-start-man commented 5 months ago

微调模型训练已完成,为何调用时依然出现问题呢? python cli_demo.py --from_pretrained "checkpoints/finetune-visualglm-6b-01-24-20-02/" --quant 4 [2024-01-25 10:40:45,884] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/anaconda3/envs/pytorch did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... [2024-01-25 10:40:53,953] [INFO] building FineTuneVisualGLMModel model ... [2024-01-25 10:40:53,955] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-01-25 10:40:53,955] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=1. If this is wrong, please pass the LOCAL_WORLD_SIZE manually. [2024-01-25 10:40:53,956] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2024-01-25 10:41:01,778] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-01-25 10:41:02,129] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-01-25 10:41:02,481] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2024-01-25 10:41:30,816] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2024-01-25 10:41:33,995] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/finetune-visualglm-6b-01-24-20-02/300/mp_rank_00_model_states.pt Traceback (most recent call last): File "/home/PycharmProjects/VisualGLM-6B/cli_demo.py", line 103, in main() File "/home/PycharmProjects/VisualGLM-6B/cli_demo.py", line 30, in main model, model_args = AutoModel.from_pretrained( File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/base_model.py", line 338, in from_pretrained return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/base_model.py", line 332, in from_pretrained_base load_checkpoint(model, args, load_path=model_path, prefix=prefix) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/training/model_io.py", line 273, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict load(self, state_dict) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load module._load_from_state_dict( File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_statedict self.weight.data.copy(state_dict[prefix+'weight']) RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0

xiongxiaochu commented 5 months ago

请问能发一下用来训练的模型文件吗?我这边mac无法安装triton不能用sat代码来下,然后清华开源的模型又有问题。。。

New-start-man commented 5 months ago

请问能发一下用来训练的模型文件吗?我这边mac无法安装triton不能用sat代码来下,然后清华开源的模型又有问题。。。

您好,训练后的模型如何发给您呢?似乎没有太好的路径可以把微调后的模型文件共享