PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

11.87k stars 2.89k forks source link

[Bug]: ai studio合入微调模型killed #8694

Open Yang-Changhui opened 1 month ago

Yang-Changhui commented 1 month ago

软件环境

- paddlepaddle-gpu: 0.0.0.post118
- paddlenlp: 2.8.0.post0

重复问题

[X] I have searched the existing issues

错误描述

硬件：ai studio V100 32G
微调模型：THUDM/chatglm2-6b

模型合并时，cpu内存占用率一直上升，直到爆内存，然后被killed。而gpu显存利用率很低，这是什么原因？
如何在合并时降低cpu的内存占用率？谢谢

稳定复现步骤 & 代码

模型微调

python finetune_generation.py chatglm2/lora_argument.json

配置文件修改

{ "dataset_name_or_path": "/home/aistudio/dataset", "per_device_train_batch_size": 1, "zero_padding": true, "use_flash_attention": true, "weight_quantize_algo": "nf4" }

模型合并

python merge_lora_params.py \ --lora_path ./checkpoints/chatglm2_lora_ckpts/checkpoint-204 \ --merge_lora_model_path ./checkpoints/chatglm2_lora_merge \ --device "gpu" \ --low_gpu_mem True

合并出错：无标题

wawltor commented 1 month ago

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

Yang-Changhui commented 1 month ago

好的，还有一个问题，使用最新的paddlenlp3.0，相同的配置，进行模型微调时，训练一会报错，使用paddlenlp2.8不会报错： OSError: (External) OSError: (External) CUBLAS error(14). [Hint: 'CUBLAS_STATUS_INTERNAL_ERROR'. An internal cuBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:1753)

wawltor commented 1 month ago

训练时显存OOM了吗？

Yang-Changhui commented 1 month ago

使用ai studio自带的监视器查看，并没有，连一半都没有

Yang-Changhui commented 1 month ago

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

wawltor commented 1 month ago

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint

wawltor commented 1 month ago

使用ai studio自带的监视器查看，并没有，连一半都没有

安装的paddle的cuda版本是否满足需要了？

Yang-Changhui commented 1 month ago

使用ai studio自带的监视器查看，并没有，连一半都没有

安装的paddle的cuda版本是否满足需要了？是的，使用的paddlepaddle-gpu==0.0.0.post118

Yang-Changhui commented 1 month ago

参数融合过程中需要将参数将在内存上进行融合，可以打开 unified checkpoint然后来避免参数融合

好像只有训练时候才能使用，打开 unified checkpoint是不是就不用合入了,而且paddlenlp2.8.0版本中配置文件中没有这个参数

2.8版本可以支持unified checkpoint 在2.8中，微调lora命令为：python finetune_generation.py ./chatglm2/lora_argument.json,要使用unified checkpoint 需要python run_pretrain.py ./chatglm2/lora_argument.json --unified_checkpoint 1 ,这两种训练方式好像不同吧，