AndrewZhe / lawyer-llama

中文法律LLaMA (LLaMA for Chinese legel domain)
Apache License 2.0
835 stars 113 forks source link

关于中文词表选择不扩充的问题 #16

Open bytes-lost opened 1 year ago

bytes-lost commented 1 year ago

论文3.1节提到

To improve the decoding efficiency of Chinese sentences, Cui et al. (2023) expand the vocabulary by adding common Chinese characters and re-training these newly added word embeddings along with the model parameters. However, our prior study shows that expanding the vocabulary does not seem to bring further improvement on downstream Chinese NLU tasks. We therefore choose to keep LLaMA’s vocabulary unchanged during the training.

请问一下,关于prior study能提供更详细一点的描述吗?

AndrewZhe commented 1 year ago

我们在部分中文nlu上进行了测试。相比较于是否扩词表,可能训练的token数目的影响会显著的更大一些。

Heepo commented 1 year ago

我们在部分中文nlu上进行了测试。相比较于是否扩词表,可能训练的token数目的影响会显著的更大一些。

同疑问,想继续请教一下,既然token数目影响更大,也就是说应该在更多的token上训练对吧?那不使用中文词表,训练效率不是很低吗?毕竟同样多的字数的中文,不扩词表的情况下token数至少是三倍以上吧。