Open niexufei opened 1 year ago
使用ptuning下的代码,和AdvertiseGen的训练数据,参数设置如下: PRE_SEQ_LEN=32 LR=1e-2
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py \ --do_train \ --train_file AdvertiseGen/train.json \ --validation_file AdvertiseGen/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/chatGPT/model/chatGLMModel/chatGLMHuggingFace/chatglm-6b \ --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 64 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 6000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4 >train.log 2>&1 &
训练完6000步之后,结果如下: { "epoch": 1.68, "train_loss": 4.087045756022135, "train_runtime": 389852.3508, "train_samples": 114599, "train_samples_per_second": 0.492, "train_steps_per_second": 0.015 } loss很大,使用web_demo时,回答的问题都不正常了。
精调完成之后,loss下降到一个合适的值,精调后的模型能够回答问题;
执行./train.sh,具体参数见上面描述
- OS:centos - Python:3.8 - Transformers:4.28.0 - PyTorch:1.13.0 - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True
No response
请问你解决问题了吗
batch再大一点??
建议tensorboard里面看一下图像是什么情况
Is there an existing issue for this?
Current Behavior
使用ptuning下的代码,和AdvertiseGen的训练数据,参数设置如下: PRE_SEQ_LEN=32 LR=1e-2
CUDA_VISIBLE_DEVICES=0 nohup python -u main.py \ --do_train \ --train_file AdvertiseGen/train.json \ --validation_file AdvertiseGen/dev.json \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /home/chatGPT/model/chatGLMModel/chatGLMHuggingFace/chatglm-6b \ --output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 64 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 6000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4 >train.log 2>&1 &
训练完6000步之后,结果如下: { "epoch": 1.68, "train_loss": 4.087045756022135, "train_runtime": 389852.3508, "train_samples": 114599, "train_samples_per_second": 0.492, "train_steps_per_second": 0.015 } loss很大,使用web_demo时,回答的问题都不正常了。
Expected Behavior
精调完成之后,loss下降到一个合适的值,精调后的模型能够回答问题;
Steps To Reproduce
执行./train.sh,具体参数见上面描述
Environment
Anything else?
No response