Error about finetuning lora

zhangtianshu commented 11 months ago

Thanks for the brilliant work!

It raised an error when I tried to finetune llama2 13b 8k lora weight. Could you let me know how to solve this?

File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward shift_logits = shift_logits.view(-1, self.config.vocab_size) File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, kwargs) File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward return self.model.forward(*args, *kwargs) File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward shift_logits = shift_logits.view(-1, self.config.vocab_size) File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward return self.base_model( RuntimeError: shape '[-1, 0]' is invalid for input of size 85058658 File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward shift_logits = shift_logits.view(-1, self.config.vocab_size) RuntimeError: shape '[-1, 0]' is invalid for input of size 101923185 File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(args, kwargs) RuntimeError: shape '[-1, 0]' is invalid for input of size 130148067 File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward return self.model.forward(*args, **kwargs) File "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 848, in forward shift_logits = shift_logits.view(-1, self.config.vocab_size) RuntimeError: shape '[-1, 0]' is invalid for input of size 85154661

yukang2017 commented 11 months ago

Hi,

Thanks for your interest in our work.

Do you mean further fine-tuning upon Llama-2-7b-longlora-8k or Llama-2-7b-longlora-8k-ft? Would you please show me your entire script?

Thanks!

Regards, Yukang Chen

zhangtianshu commented 11 months ago

Thanks for your response! I mean Llama-2-7b-longlora-8k.

The script is here:

torchrun --nproc_per_node=4 supervised_fine_tune_rel_extraction.py \ --model_name_or_path ../../../../fs/scratch/PAA0201/tables/Llama-2-13b_longlora_8k_merged \ --bf16 True \ --output_dir ../../../../fs/scratch/PAA0201/tables/Llama-2-13b-longlora-8k-ft_rel_extraction_full_train_3epochs_lora \ --model_max_length 8192 \ --use_flash_attn True \ --data_path /users/PAA0201/shubaobao/LongLoRA/train_data/train_temp.json \ --low_rank_training True \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0.0 \ --warmup_steps 20 \ --lr_scheduler_type "constant_with_warmup" \ --logging_steps 1 \ --deepspeed "ds_configs/stage3.json" \ --tf32 True

yukang2017 commented 11 months ago

Hi,

I think the problem exists in the "vocab_size". Would you please check that in your "dir_to/Llama-2-13b_longlora_8k_merged", is there a config.json file? in the config.json file, whether vocab_size is 32001 or 32000? It should be 32001.

If these are all right, would you please have a try on to modify your "/users/PAA0201/shubaobao/anaconda3/envs/longlora_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py" in line 848, from self.config.vocab_size to self.config.vocab_size + 1

Please let me know whether this trial can work.

Regards, Yukang Chen

yukang2017 commented 11 months ago

I will close this issue as it has been inactive for several days. Please feel free to re-open it if there are any other things to discuss.

Klein73 commented 11 months ago

gptneox也会遇到这个问题，在这里报错误： query = shift(query, self.num_heads, self.head_dim).contiguous()

错误我粘贴短一点哈： File "/root/.cache/huggingface/modules/transformers_modules/antlaw_v110/modeling_qwen.py", line 401, in forward query = shift(query, self.num_heads, self.head_dim).contiguous() File "/root/.cache/huggingface/modules/transformers_modules/antlaw_v110/modeling_qwen.py", line 397, in shift qkv = qkv.reshape(bsz * num_group, group_size, num_heads, head_dim).transpose(1, 2) RuntimeError: shape '[0, 16384, 40, 128]' is invalid for input of size 332800

辛苦也看一下。我不确定是不是在group_size这里写的有问题，我是参考另一个issue修改的： group_size_ratio = 1/8 sft_group_size = 16384 if q_len % 2048 == 0: group_size = int(q_len * group_size_ratio) else: group_size = sft_group_size num_group = q_len // group_size

Klein73 commented 11 months ago

@yukang2017 我大概发现了问题，当q_len不能整除2048的时候，如果把group_size设置为8192或者16384（即max_seq_length），那么会导致reshape的时候与query,key,value的维度对不上。这里是否应该这样写： if q_len % 2048 == 0: group_size = int(q_len * group_size_ratio) else: group_size = q_len // 与gptneox不同的地方，让group_size等于q_len

dvlab-research / LongLoRA

Error about finetuning lora #19