Fine-tuning with unlabelled data? (Causal language modelling)

artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

https://arxiv.org/abs/2305.14314

MIT License

10.04k stars 821 forks source link

Fine-tuning with unlabelled data? (Causal language modelling) #21

Open leoplusx opened 1 year ago

leoplusx commented 1 year ago

I'd like to fine-tune using unlabelled data, i.e. a causal language modeling. For instance to adapt a model to a new domain or language.

Which parts of the training code need to be changed to use such a data source?

From what I can tell, it would probably be these:

DataCollatorForCausalLM (perhaps use DataCollatorForLanguageModeling from transformers)
make_data_module()
MMLUEvalCallback

Is that correct? Anything else?

Is there perhaps code from this or another repo that I can use?

Thanks!

Edit: Replaced "masked language modeling" with "causal language modeling".

aljungberg commented 1 year ago

Might not be this simple but you could try just feeding in your samples as {'input': '', 'output': '<the text>'}. So basically "given nothing, predict the whole thing". However if your samples are longer than the sequence length, they'll be truncated rather than windowed by this code.

So you'd need to either make sure all your samples tokenise to no more than (and ideally exactly) self.target_max_len. Since you're going to need to modify make_data_module anyhow, you could do it there.

As for do_mmlu_eval, you could just disable that with --do_mmlu_eval=False, right?

aljungberg commented 1 year ago

I tested this and it seems to work. At least eval perplexity goes down over time and when I load the lora in textgen the results look OK at first glance. Make sure to use CUDA_VISIBLE_DEVICES to force single-gpu. I get device-side assertions otherwise, although this was true with just load_in_8bits too and I think it might be some current bug in bitsandbytes.