ptuning微调时，loss下降很慢怎么办？

niexufei commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

使用ptuning下的代码，和AdvertiseGen的训练数据，参数设置如下： PRE_SEQ_LEN=32 LR=1e-2

CUDA_VISIBLE_DEVICES=0 nohup python -u main.py \ --do_train \ --train_file AdvertiseGen/train.json \ --validation_file AdvertiseGen/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/chatGPT/model/chatGLMModel/chatGLMHuggingFace/chatglm-6b \ --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 64 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 6000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4 >train.log 2>&1 &

训练完6000步之后，结果如下： { "epoch": 1.68, "train_loss": 4.087045756022135, "train_runtime": 389852.3508, "train_samples": 114599, "train_samples_per_second": 0.492, "train_steps_per_second": 0.015 } loss很大，使用web_demo时，回答的问题都不正常了。

Expected Behavior

精调完成之后，loss下降到一个合适的值，精调后的模型能够回答问题；

Steps To Reproduce

执行./train.sh，具体参数见上面描述

Environment

- OS:centos
- Python:3.8
- Transformers:4.28.0
- PyTorch:1.13.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

terminator123 commented 1 year ago

请问你解决问题了吗

zzy347964399 commented 1 year ago

batch再大一点？？

zzy347964399 commented 1 year ago

建议tensorboard里面看一下图像是什么情况

THUDM / ChatGLM-6B