128k的模型使用lora微调后，进行推理的时候卡住怎么回事？

dazzlingCn commented 5 months ago

System Info / 系統信息

cuda 11.7，t4卡，pytorch版本“1.11.0+cu113”

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[x] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

使用官方的微调脚本进行lora微调的：

1、微调：python finetune_hf.py data/xdd/ THUDM/chatglm3-6b-128k configs/lora.yaml 没有报错，成功了

2、合并模型：python merge_model.py output/checkpoint-10000 THUDM/chatglm3-6b-128k-n2

3、推理过程：推理的时候，先是报了一个错误“AttributeError: can't set attribute 'eos_token'”，删除tokenizer_config.json中的eos_token、pad_token、unk_token就可以了，后面就是正常加载和推理，对于比较短的prompt（20个中文汉字以下），一般可以正常推理，但是超过字数以后，会出现疑似卡死，推理半小时没有结果，但是显存占用基本拉满（24G），显存占用有时候还会变化。

Expected behavior / 期待表现

希望正常在十几秒内返回推理结果

zRzRzRzRzRzRzR commented 5 months ago

你微调的长度多长呢，这个情况好像没遇到，初步怀疑是溢出

dazzlingCn commented 5 months ago

你微调的长度多长呢，这个情况好像没遇到，初步怀疑是溢出我的所有微调参数都在这里了： num_proc: 16 max_input_length: 128 max_output_length: 256 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 10000

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config: max_new_tokens: 256

set your absolute deepspeed path here

deepspeed: ds_zero_2.json

set to true if train with cpu.

use_cpu: false peft_config: peft_type: LORA task_type: CAUSAL_LM r: 1 lora_alpha: 2 lora_dropout: 0.1

dazzlingCn commented 5 months ago

你微调的长度多长呢，这个情况好像没遇到，初步怀疑是溢出

请问什么溢出显存溢出了吗我应该怎么改呢

zRzRzRzRzRzRzR commented 5 months ago

是的显存溢出，你是什么显卡，你尝试一次推理短一点看是否正常，大概到大概2k字开始卡吗

THUDM / ChatGLM3