About pretraining data amount

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.16k stars 627 forks source link

About pretraining data amount #130

Closed YijiaXiao closed 3 years ago

YijiaXiao commented 3 years ago

Hi, thank you for your great work! I have read the related paper <_MSA Transformer_>. In the paper, the authors mentioned that they used 26 million MSAs, which is a large amount and trained for 100k updates. I wonder does the amount of data and training steps matters a lot? If I want to pre-train one MSA transformer model on a much smaller dataset, say 1M MSAs, will the performance (e.g. contact prediction) of MSA model drop a lot? Thank you!

tomsercu commented 3 years ago

Most likely yes it will be impacted by the amount of training data. Other important factors are the diversity of the training data and overlap between pretraining and the structures where you perform the downstream contact prediction.

YijiaXiao commented 3 years ago

Hi @tomsercu , thank you for your timely reply, it seems that pretraining do need large quantities of diverse data. Thank you :)