PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12k stars 2.93k forks source link

[Bug]: llm merge_lora_params 合并后不保存 merge权重 #8575

Closed sanbuphy closed 1 month ago

sanbuphy commented 3 months ago

软件环境

- paddlepaddle: 
- paddlepaddle-gpu:  develop 
- paddlenlp: lastest 162d8d31c84f60b804a0abeee8f4f1e4b32308ef

重复问题

错误描述

使用 llm merge_lora_params.py,合并一个 QLora 训练好的模型,但是没有合并后的模型结果,输出文件夹什么都没出现

稳定复现步骤 & 代码

python merge_lora_params.py \ --model_name_or_path FlagAlpha/Llama2-Chinese-7b-Chat \ --lora_path /home/aistudio/data/checkpoints/llama_lora_ckpts/checkpoint-286 \ --merge_lora_model_path /home/aistudio/data/llama_lora_merge \ --device "gpu" \ --low_gpu_mem True

似乎一直卡在加载的阶段,然后过一阵子后直接结束进程。(怀疑内存不够,但应该不至于吧 ,aistudio 32g v100 开发机)

image

image

但并非是 lora 问题,因为可以动态图加载推理

python predictor.py --model_name_or_path FlagAlpha/Llama2-Chinese-7b-Chat \
                    --data_file /home/aistudio/data/dummy/dev.json --dtype float16 \
                    --lora_path /home/aistudio/data/checkpoints/llama_lora_ckpts/checkpoint-286
DesmonDay commented 3 months ago
截屏2024-06-12 13 48 44

能否把代码单独拎出来,如果你只是加载llama参数可以正常加载么?from_pretrained。

sanbuphy commented 3 months ago
截屏2024-06-12 13 48 44

能否把代码单独拎出来,如果你只是加载llama参数可以正常加载么?from_pretrained。

我试试看,不过我怀疑是直接被kill了 ,不过不至于 32G都不够用? 很神奇

DesmonDay commented 3 months ago

嗯嗯,Killed不排除你环境问题,可能是有别人也在使用机器。

sanbuphy commented 3 months ago

嗯嗯,Killed不排除你环境问题,可能是有别人也在使用机器。

我知道问题在哪了。带上 model_name_or_path 字段就不能正常保存,只会存一个 json;

去掉后就可以正常 merge, Qlora应该不会对这个有影响吧;感觉是 字段导致的问题

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。