Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 421 forks source link

基于merge.json训练了3轮,效果较checkpoint-final差很多,没有改动参数,求指导。 #92

Closed xienan0326 closed 1 year ago

Facico commented 1 year ago

硬件配置是什么?loss都正常吗?

xienan0326 commented 1 year ago

硬件配置是什么?loss都正常吗? 40g的A100,loss正常,如下所示。 {'loss': 0.9024, 'learning_rate': 3.563218390804597e-05, 'epoch': 0.89} {'eval_loss': 0.8243093490600586, 'eval_runtime': 0.3851, 'eval_samples_per_second': 2.597, 'eval_steps_per_second': 2.597, 'epoch': 0.89}
{'loss': 0.9073, 'learning_rate': 3.333333333333333e-05, 'epoch': 0.89} {'loss': 0.9076, 'learning_rate': 3.1034482758620685e-05, 'epoch': 0.9} {'loss': 0.9148, 'learning_rate': 2.8735632183908045e-05, 'epoch': 0.91} {'loss': 0.9093, 'learning_rate': 2.6436781609195398e-05, 'epoch': 0.91} {'loss': 0.9006, 'learning_rate': 2.4137931034482755e-05, 'epoch': 0.92} {'loss': 0.9077, 'learning_rate': 2.1839080459770115e-05, 'epoch': 0.93} {'loss': 0.9055, 'learning_rate': 1.9540229885057468e-05, 'epoch': 0.94} {'loss': 0.9076, 'learning_rate': 1.7241379310344825e-05, 'epoch': 0.94} {'loss': 0.9092, 'learning_rate': 1.4942528735632182e-05, 'epoch': 0.95} {'loss': 0.9058, 'learning_rate': 1.264367816091954e-05, 'epoch': 0.96} {'eval_loss': 0.8262965679168701, 'eval_runtime': 0.3839, 'eval_samples_per_second': 2.605, 'eval_steps_per_second': 2.605, 'epoch': 0.96}
{'loss': 0.9008, 'learning_rate': 1.0344827586206895e-05, 'epoch': 0.97} {'loss': 0.9061, 'learning_rate': 8.045977011494252e-06, 'epoch': 0.97} {'loss': 0.9065, 'learning_rate': 5.747126436781608e-06, 'epoch': 0.98} {'loss': 0.9006, 'learning_rate': 3.448275862068965e-06, 'epoch': 0.99} {'loss': 0.9073, 'learning_rate': 1.1494252873563217e-06, 'epoch': 1.0}

xienan0326 commented 1 year ago

硬件配置是什么?loss都正常吗? 加载的checkpoint-2600里的模型进行预测 image

xienan0326 commented 1 year ago

硬件配置是什么?loss都正常吗? finetune.sh image

Facico commented 1 year ago

参数跟我们一样,3个epoch的step应该是checkpoint-17200左右,可以看看你的数据merge.json有没有问题?或者batch size是128吗

xienan0326 commented 1 year ago

参数跟我们一样,3个epoch的step应该是checkpoint-17200左右,可以看看你的数据merge.json有没有问题?或者batch size是128吗 merge.json from link: https://pan.baidu.com/s/1WSxuhSAotl14ifaAiz5eKw?pwd=b4kb password: b4kb image

xienan0326 commented 1 year ago

参数跟我们一样,3个epoch的step应该是checkpoint-17200左右,可以看看你的数据merge.json有没有问题?或者batch size是128吗

您好,请教一下,all_data指的是哪些数据集,因为final版本测试起来,挺不错的,想复现学习一下,plz。 image

Facico commented 1 year ago

这个all_data是belle 0.5M和guanaco的数据。其实你的问题看起来像是训练的还不太够?我们checkpoint-final是17200step的,你上面测的是2600step的

xienan0326 commented 1 year ago

参数跟我们一样,3个epoch的step应该是checkpoint-17200左右,可以看看你的数据merge.json有没有问题?或者batch size是128吗 这是finetune.py的参数,有一个问题,3个EPOCHS,您花费了多长时间? image image

grantchenhuarong commented 1 year ago

基于LLaMA-7B32层网络的中文小羊驼,制作了43万指令问答语料,在checkpoint-11600的Lora模型基础上,使用2080Ti11GB显卡训练。显存占用10GB,75秒左右一轮,需要执行17298-11600=5698轮,预期5天左右完成训练。

MICRO_BATCH_SIZE = 8 BATCH_SIZE = 256 MAX_STEPS = None GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE EPOCHS = 3 # we don't always need 3 tbh LEARNING_RATE = 3e-4 # the Karpathy constant CUTOFF_LEN = 256 # 256 accounts for about 96% of the data LORA_R = 8 LORA_ALPHA = 16 LORA_DROPOUT = 0.05 VAL_SET_SIZE = args.test_size #1000 TARGET_MODULES = [ "q_proj", "v_proj", ]