facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

retrain esm #137

Closed poppy-yebo closed 2 years ago

poppy-yebo commented 2 years ago

I just want to know how to only use my sequence to train this model.thank you very much

tomsercu commented 2 years ago

We straightforwardly used fairseq to train the language model. In order to pretrain from scratch you'll need a large sequence database. See #33 for some pointers.

shikhar249 commented 2 years ago

Hi Tom,

How large should the sequence database be for retraining the model ? I have a sequence database consisting around 3000 sequence.

tomsercu commented 2 years ago

For pretraining you'd typically want to use very large sequence databases, since the language model of this size would easily memorize smaller datasets. In the case of 3000 sequences, you'd typically want to finetune a pre-trained language model.