Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 425 forks source link

多卡finetune_chat时报mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008) #240

Open 18065013 opened 1 year ago

18065013 commented 1 year ago

单卡情况下能跑通并且正常训练,未改过代码在多卡时报错,求解 mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008)

18065013 commented 1 year ago

(vicuna_training_310) root@ubuntu-3090x2:/home/huwei/training/Chinese-Vicuna# python finetune_chat.py --data_path sample/merge_split_s1.json --model_path decapoda-research/llama-7b-hf --test_size=10

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. Using custom data configuration default-9e71a4e6f9c8d1a3 Found cached dataset json (/root/.cache/huggingface/datasets/json/default-9e71a4e6f9c8d1a3/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 759.84it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 191.43ex/s] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:10<00:00, 3.13it/s] normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

0: 0%| | 0/625 [00:00<?, ?ex/s]normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization. | 0/625 [00:00<?, ?ex/s]

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 201.15ex/s]

0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 197.08ex/s]

3: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 198.10ex/s]

2: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [00:03<00:00, 193.27ex/s]

The following columns in the training set don't have a corresponding argument in PeftModelForCausalLM.forward and have been ignored: input, output. If input, output are not expected by PeftModelForCausalLM.forward, you can safely ignore this message.████████████████████████████████████████▉ | 608/625 [00:03<00:00, 202.11ex/s] /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning00:00, 209.69ex/s] warnings.warn( Running training Num examples = 2500 Num Epochs = 3 Instantaneous batch size per device = 4 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 32 Total optimization steps = 19 Number of trainable parameters = 19988480 0%| | 0/19 [00:00<?, ?it/s]/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/huwei/training/Chinese-Vicuna/finetune_chat.py:273 in │ │ │ │ 270 if torch.version >= "2" and sys.platform != "win32": │ │ 271 │ model = torch.compile(model) │ │ 272 │ │ ❱ 273 trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) │ │ 274 model.save_pretrained(OUTPUT_DIR) │ │ 275 │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:16 │ │ 36 in train │ │ │ │ 1633 │ │ inner_training_loop = find_executable_batch_size( │ │ 1634 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1635 │ │ ) │ │ ❱ 1636 │ │ return inner_training_loop( │ │ 1637 │ │ │ args=args, │ │ 1638 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1639 │ │ │ trial=trial, │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:19 │ │ 03 in _inner_training_loop │ │ │ │ 1900 │ │ │ │ │ with model.no_sync(): │ │ 1901 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1902 │ │ │ │ else: │ │ ❱ 1903 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1904 │ │ │ │ │ │ 1905 │ │ │ │ if ( │ │ 1906 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:26 │ │ 49 in training_step │ │ │ │ 2646 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │ │ 2647 │ │ │ │ 2648 │ │ with self.compute_loss_context_manager(): │ │ ❱ 2649 │ │ │ loss = self.compute_loss(model, inputs) │ │ 2650 │ │ │ │ 2651 │ │ if self.args.n_gpu > 1: │ │ 2652 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/trainer.py:26 │ │ 81 in compute_loss │ │ │ │ 2678 │ │ │ labels = inputs.pop("labels") │ │ 2679 │ │ else: │ │ 2680 │ │ │ labels = None │ │ ❱ 2681 │ │ outputs = model(inputs) │ │ 2682 │ │ # Save past state if it exists │ │ 2683 │ │ # TODO: this needs to be fixed and made cleaner later. │ │ 2684 │ │ if self.args.past_index >= 0: │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py │ │ :1194 in _call_impl │ │ │ │ 1191 │ │ # this function, and just call forward. │ │ 1192 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │ │ 1193 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1194 │ │ │ return forward_call(*input, *kwargs) │ │ 1195 │ │ # Do not call functions when jit is used │ │ 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1197 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/data_par │ │ allel.py:171 in forward │ │ │ │ 168 │ │ │ if len(self.device_ids) == 1: │ │ 169 │ │ │ │ return self.module(inputs[0], kwargs[0]) │ │ 170 │ │ │ replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) │ │ ❱ 171 │ │ │ outputs = self.parallel_apply(replicas, inputs, kwargs) │ │ 172 │ │ │ return self.gather(outputs, self.output_device) │ │ 173 │ │ │ 174 │ def replicate(self, module, device_ids): │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/data_par │ │ allel.py:181 in parallel_apply │ │ │ │ 178 │ │ return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) │ │ 179 │ │ │ 180 │ def parallel_apply(self, replicas, inputs, kwargs): │ │ ❱ 181 │ │ return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) │ │ 182 │ │ │ 183 │ def gather(self, outputs, output_device): │ │ 184 │ │ return gather(outputs, output_device, dim=self.dim) │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/parallel │ │ _apply.py:89 in parallel_apply │ │ │ │ 86 │ for i in range(len(inputs)): │ │ 87 │ │ output = results[i] │ │ 88 │ │ if isinstance(output, ExceptionWrapper): │ │ ❱ 89 │ │ │ output.reraise() │ │ 90 │ │ outputs.append(output) │ │ 91 │ return outputs │ │ 92 │ │ │ │ /root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/_utils.py:543 in │ │ reraise │ │ │ │ 540 │ │ │ # If the exception takes multiple arguments, don't try to │ │ 541 │ │ │ # instantiate since we don't know how to │ │ 542 │ │ │ raise RuntimeError(msg) from None │ │ ❱ 543 │ │ raise exception │ │ 544 │ │ 545 │ │ 546 def _get_available_device_type(): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward return self.base_model( File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(*args, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward outputs = self.model( File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(args, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 607, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(args) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 603, in custom_forward return module(inputs, output_attentions, None) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(*args, *kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward hidden_states = self.mlp(hidden_states) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in new_forward output = old_forward(*args, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward return self.down_proj(self.act_fn(self.gate_proj(x)) self.up_proj(x)) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward result = super().forward(x) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/root/anaconda3/envs/vicuna_training_310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward output += torch.matmul(subA, state.subB) RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x2 and 1x11008)

18065013 commented 1 year ago
1