DachengLi1 / LongChat

Official repository for LongChat and LongEval
Apache License 2.0
504 stars 29 forks source link

unsupervised pre-training on the model #2

Closed wqn1 closed 1 year ago

wqn1 commented 1 year ago

Before fine-tuning the model, did you perform unsupervised pre-training on the model? Can you provide the script for unsupervised pre-training and the required training resources?

ahkimkoo commented 1 year ago

这个项目出来是否意味着vicuna-13b-16k马上就出来了。

DachengLi1 commented 1 year ago

@wqn1 No, we simply apply the condensing RoPE technique, which is a monkey patch in code. Then we just do fine-tuning on conversation data.

The pre-training directory was there for a historical reason. We tried some unsupervised pre-training before without the condensing RoPE technique, but realized that it it not needed anymore. I will probably going to refactor out the directory. Thx for the question!

DachengLi1 commented 1 year ago

@ahkimkoo Vicuna is another model targeted on 2K sequence length, and LongChat is our new series of models targeting on larger sequence length. Basically Vicuna = llama-2K + ShareGPT, and LongChat = llama-nK + ShareGPT. So there won't be a Vicuna-13b-16k, but LongChat-13b-16k will be conceptually the same as a "Vicuna-13b-16k".

wqn1 commented 1 year ago

@DachengLi1 Thanks for your answer. I also have a question to consult. If I want to train a long-chat model in a specific domain, but I don't have a lot of supervised dialogue data available, would unsupervised pre-training help improve the model's performance in domain-specific question answering? Could you please provide me with some advice on this?

DachengLi1 commented 1 year ago

@wqn1 I think there are two things to adapt (1) context length: 2K > nK (2) general -> domain specific. Right now I just throw 18K conversations in and LlaMA adapt quite well to both within this many conversations. If you have less conversations data, which is really necessary only in (2), I think using some other long-context data(e.g. books) to help llama learns (1) should work. But you can probably start by trying whether llama can do well with your number of data?