Open SinanAkkoyun opened 1 year ago
Hi, I've ran into the same task, is there any suggestions how to approach it?
As you point out, pretraining and finetuning are similar concepts. In fact, the way we load the Guanaco Open Assistant dataset is similar to how you would load an unlabeled dataset. Just leave the input
field blank and put your unlabeled data directly in the output
field in the dataset. You will need to adjust the number of tokens you accept in the source/target.
Oh, so I could for example just provide data like 'the stack' in the output only? Would that computationally be the same as splitting a 'page' of data to input and output randomly multiple times? (so what I am asking is, is the input/output computationally irrelevant in the sense that putting unlabeled data in output is the same as mixmatching input output?)
Thank you very much for your answer :)
@artidoro Thank you very much for the clarification! If I understood everything correctly, we should put the raw text in the "output" field in json solely. That pretty much mean no system command provided to the LLM and no context (input). This is the same as a plain causal language model training. Though we should care about 'pagification' of the data.
Hi! I would like to use QLora to "pretrain" a model and wanted to ask if that is possible, in the release time of qlora I've heard something about a 'raw' mode not existing right now
For example, let's say I had a big dataset in the style of 'the pile' but in another language, how can I pretrain a llama model with that without construction complete prompt response pairs? Or is QLora only designed for full prompt - response pairs?
I am looking very forward to any help!