How to pretrain "raw" text?

artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

https://arxiv.org/abs/2305.14314

MIT License

9.96k stars 820 forks source link

How to pretrain "raw" text? #205

Open SinanAkkoyun opened 1 year ago

SinanAkkoyun commented 1 year ago

Hi! I would like to use QLora to "pretrain" a model and wanted to ask if that is possible, in the release time of qlora I've heard something about a 'raw' mode not existing right now

For example, let's say I had a big dataset in the style of 'the pile' but in another language, how can I pretrain a llama model with that without construction complete prompt response pairs? Or is QLora only designed for full prompt - response pairs?

I am looking very forward to any help!

nerusskikh commented 1 year ago

Hi, I've ran into the same task, is there any suggestions how to approach it?

artidoro commented 1 year ago

As you point out, pretraining and finetuning are similar concepts. In fact, the way we load the Guanaco Open Assistant dataset is similar to how you would load an unlabeled dataset. Just leave the input field blank and put your unlabeled data directly in the output field in the dataset. You will need to adjust the number of tokens you accept in the source/target.

SinanAkkoyun commented 1 year ago

Oh, so I could for example just provide data like 'the stack' in the output only? Would that computationally be the same as splitting a 'page' of data to input and output randomly multiple times? (so what I am asking is, is the input/output computationally irrelevant in the sense that putting unlabeled data in output is the same as mixmatching input output?)

Thank you very much for your answer :)

nerusskikh commented 1 year ago

@artidoro Thank you very much for the clarification! If I understood everything correctly, we should put the raw text in the "output" field in json solely. That pretty much mean no system command provided to the LLM and no context (input). This is the same as a plain causal language model training. Though we should care about 'pagification' of the data.