facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

how much pre-training data is needed to get a decent Long_P@L score for unsupervised contact prediction? #270

Closed zhenyuhe00 closed 2 years ago

zhenyuhe00 commented 2 years ago

Hi, Congrats again on your great work!

I use the pre-trained Bert base checkpoint esm1_t12_85M_UR50S pretrained on over 20 million sequences you guys released and test its unsupervised contact prediction performance. The Long_P@L of it is about 0.20~0.30.

However, I also pre-trained a Bert-base of 85M parameters but the Long_P@L of it on the same test dataset is less than 0.05. The difference between my Bert base and your esm1_t12_85M_UR50S is that my Bert base is post norm, crop size 384, pretrained on 0.5 million sequences (esm1_t12_85M_UR50S is pre norm, crop size 1024, pretrained on over 20million sequences).

I wonder why my Bert base is much worse than your Bert base, is it because of the different amount of pretraining sequences?

Thanks in advance!

naailkhan28 commented 2 years ago

How are you sampling those 500k sequences? Are they chosen from UR50 clusters or just arbitrarily chosen? If you've trained on a much smaller set of less diverse sequences this could be the reason for your worse performance.