how much pre-training data is needed to get a decent Long_P@L score for unsupervised contact prediction?

Hi, Congrats again on your great work!

I use the pre-trained Bert base checkpoint esm1_t12_85M_UR50S pretrained on over 20 million sequences you guys released and test its unsupervised contact prediction performance. The Long_P@L of it is about 0.20~0.30.

However, I also pre-trained a Bert-base of 85M parameters but the Long_P@L of it on the same test dataset is less than 0.05. The difference between my Bert base and your esm1_t12_85M_UR50S is that my Bert base is post norm, crop size 384, pretrained on 0.5 million sequences (esm1_t12_85M_UR50S is pre norm, crop size 1024, pretrained on over 20million sequences).

I wonder why my Bert base is much worse than your Bert base, is it because of the different amount of pretraining sequences?

Thanks in advance!

facebookresearch / esm

how much pre-training data is needed to get a decent Long_P@L score for unsupervised contact prediction? #270