facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

"unsupervised" contact prediction #96

Closed intersun closed 3 years ago

intersun commented 3 years ago

Thanks for the awesome work!

I do have one question after reading the paper, there might be some misunderstanding from my side. Obviously for the "unsupervised" contact prediction proposed in paper, it used the labeled data to fit the logistic regression, then why is it called unsupervised contact prediction? Since Potts model (or CCMpred) are not using the label at all, is the comparison unfair?

Thanks again.

alexrives commented 3 years ago

Thank you for your interest in our work! We use the term unsupervised because the contacts are learned directly via the unsupervised language modeling objective. The point we are making is that the logistic regression is not learning the contacts, the language modeling is.

Note that the attention heads predict contacts directly without using the regression weights. Results for averaging the top 1, 5, and 10 heads are shown in Table 2. Simply averaging the top 5 heads already performs better than training a Potts model using the same sequence database used for training ESM (in Table 2 compare Gremlin on ESM data to the lines for top-5 and top-10 heads). When logistic regression weights are used, they are fit with just 20 proteins. This improves performance further over averaging the heads. Figure 12 shows bootstrap results indicating that any randomly selected 20 proteins produce similar results. In the Low-N supervision section we show that the regression can be fit with even a single example.

All of this is evidence that the contacts are learned by the unsupervised pre-training -- which makes Potts models the natural comparison.