Open leoplusx opened 1 year ago
Might not be this simple but you could try just feeding in your samples as {'input': '', 'output': '<the text>'}
. So basically "given nothing, predict the whole thing". However if your samples are longer than the sequence length, they'll be truncated rather than windowed by this code.
So you'd need to either make sure all your samples tokenise to no more than (and ideally exactly) self.target_max_len
. Since you're going to need to modify make_data_module
anyhow, you could do it there.
As for do_mmlu_eval
, you could just disable that with --do_mmlu_eval=False
, right?
I tested this and it seems to work. At least eval perplexity goes down over time and when I load the lora in textgen the results look OK at first glance. Make sure to use CUDA_VISIBLE_DEVICES
to force single-gpu. I get device-side assertions otherwise, although this was true with just load_in_8bits
too and I think it might be some current bug in bitsandbytes.
I'd like to fine-tune using unlabelled data, i.e. a causal language modeling. For instance to adapt a model to a new domain or language.
Which parts of the training code need to be changed to use such a data source?
From what I can tell, it would probably be these:
DataCollatorForCausalLM
(perhaps useDataCollatorForLanguageModeling
from transformers)make_data_module()
MMLUEvalCallback
Is that correct? Anything else?
Is there perhaps code from this or another repo that I can use?
Thanks!
Edit: Replaced "masked language modeling" with "causal language modeling".