Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 425 forks source link

Not an issue but a question for going forwards #227

Open thusinh1969 opened 1 year ago

thusinh1969 commented 1 year ago

Hi,

I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.

How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?

I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https://github.com/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.

Thanks for any input. Steve

Facico commented 1 year ago

Here is a similar issue: #12

Thank you for your interest in our project. LLaMA is a multilingual model and does have some proficiency in Chinese. Considering the lack of a strong Chinese base, we chose to use LLaMA as the foundation.

Given sufficient hardware resources, full-scale fine-tuning would certainly yield better results compared to using Lora, such as with FastChat's Vicuna.

The method of expanding the vocabulary for Chinese-LLaMA-Alpaca also requires extensive pretraining, which can be done if the hardware conditions are adequate. LLaMA itself utilizes encoding mechanisms that can encode many Chinese characters, but achieving one-to-one encoding is relatively limited, hence the need for vocabulary expansion.