Closed poppy-yebo closed 2 years ago
We straightforwardly used fairseq to train the language model. In order to pretrain from scratch you'll need a large sequence database. See #33 for some pointers.
Hi Tom,
How large should the sequence database be for retraining the model ? I have a sequence database consisting around 3000 sequence.
For pretraining you'd typically want to use very large sequence databases, since the language model of this size would easily memorize smaller datasets. In the case of 3000 sequences, you'd typically want to finetune a pre-trained language model.
I just want to know how to only use my sequence to train this model.thank you very much