TencentARC / LLaMA-Pro

[ACL 2024] Progressive LLaMA with Block Expansion.
https://tencentarc.github.io/LLaMA-Pro/
Apache License 2.0
449 stars 34 forks source link

Thanks for wonderful projects ! Why I always got the results of apparent loss of original ability? #25

Open hzgdeerHo opened 2 months ago

hzgdeerHo commented 2 months ago

After finetuned the llama-3-8B-instruct with the same configuration ,as the code from:https://github.com/hiyouga/LLaMA-Factory/tree/3df986c6793a51ec2cb5f31fd1808cd3a9883bc4/examples/extrasexamples/extras/llama_pro always leads to apparent loss of original ability? I only used the train datasets "Identity". Can you help? THANKS

hzgdeerHo commented 2 months ago

The final training loss is about 0.1-0.05 ,and I think it is might not be caused by overfitting ?

hills-code commented 2 months ago

Hi! Have you tried to directly finetune llama-3-8B-instruct? What will happen in this setting? I did not carry out the experiments with llama-3 so maybe I am not very familiar with the feature of it. I think you can also try to change the position of the added blocks. Recent Yi-tech report and some llama3-120B models show that maybe fix the first few layers are important. Hope this will be helpful!

hzgdeerHo commented 2 months ago

OK,thanks! Could you show me some link as reference to figure out the problem?

hills-code commented 2 months ago

Certainly! Here is the link to Yi-9B https://huggingface.co/01-ai/Yi-9B and its tech report https://arxiv.org/pdf/2403.04652 You can find the depth upscaling in the Sec 7.3 image and LLaMa3-120B https://huggingface.co/alpindale/goliath-120b

hzgdeerHo commented 2 months ago

Thanks !

hzgdeerHo commented 2 months ago

I have post this new issue :https://github.com/hiyouga/LLaMA-Factory/issues/3811 . Would you please help to explain ? Thanks!

hiyouga commented 2 months ago

Using small datasets and large epochs in training can easily lead to overfitting.

hzgdeerHo commented 2 months ago

Thanks!