facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
2.97k stars 586 forks source link

Discrepancy between ESMFold predicted `distogram_logits` and final output atomic coordinates #635

Open Hejl19 opened 7 months ago

Hejl19 commented 7 months ago

Thank you for your excellent work! However, I've encountered an issue while using ESMFold (esm2_650M) and I hope you can help clarify.

Firstly, I used ESMFold to predict the structure of a specific sequence("MTILSKDRISLQTSASDKADAIRKAGQLLVATGCVLPEYVDGMLAREQSMSTSLGNGVAIPHGVYENRGHILKTGISVLQLPAGVDWDEGE"). The prediction result (result.pdb) was as expected, with an average pLDDT of around 80. I then attempted to compare the structure['distogram_logits'] predicted by ESMFold with the final output atomic coordinates:

I mainly used openfold.np.protein.from_pdb_string, openfold.data.data_pipeline.make_pdb_features, and openfold.data.feature_pipeline.np_example_to_feature methods to calculate the C_beta atomic distances (true_distance) between residues in the "result.pdb" file given by ESMFold. Upon verification, this step appears correct (the pseudo_beta value provided by openfold matches with "result.pdb").

Then, I calculated true_bins using parameters min_bin=2.3125, max_bin=21.6875, num_bins=64 (I followed openfold.utils.loss.distogram_loss to calculate true_bins and it seems correct) and tried to compare it with structure['distogram_logits'].argmax(dim=-1) from the ESMFold prediction process. However, I found a significant discrepancy:

image

In the above image, there's a substantial difference between the true_bins[0,:] and disto_logits.argmax(dim=-1)[0,0,:]. The former represents the actual distances between residue 0 and others given by result.pdb, while the latter represents the predicted distances between the same given by structure['distogram_logits'].

Following the approach in esmfold.v1.trunk, I calculated the bins based on predicted positions using recycle_bins = FoldingTrunk.distogram(structure["positions"][-1][:, :, :3], 2.3125, 21.6875, 64,) after recycling. The result is recycle_bins[0,0,:] shown in the above image. I found that recycle_bins is very close to true_bins.

It makes sense that recycle_bins is very close to true_bins since ESMFold constructs "result.pdb" mainly based on structure['positions']. However, I'm curious why predicted distogram_logits differs greatly from true_bins. Since you have not released your training code, I'm uncertain about how you optimize the distogram_head parameters. I would appreciate your clarification on this matter. Thanks!