Thank you for your excellent work! However, I've encountered an issue while using ESMFold (esm2_650M) and I hope you can help clarify.
Firstly, I used ESMFold to predict the structure of a specific sequence("MTILSKDRISLQTSASDKADAIRKAGQLLVATGCVLPEYVDGMLAREQSMSTSLGNGVAIPHGVYENRGHILKTGISVLQLPAGVDWDEGE"). The prediction result (result.pdb) was as expected, with an average pLDDT of around 80. I then attempted to compare the structure['distogram_logits'] predicted by ESMFold with the final output atomic coordinates:
I mainly used openfold.np.protein.from_pdb_string, openfold.data.data_pipeline.make_pdb_features, and openfold.data.feature_pipeline.np_example_to_feature methods to calculate the C_beta atomic distances (true_distance) between residues in the "result.pdb" file given by ESMFold. Upon verification, this step appears correct (the pseudo_beta value provided by openfold matches with "result.pdb").
Then, I calculated true_bins using parameters min_bin=2.3125, max_bin=21.6875, num_bins=64 (I followed openfold.utils.loss.distogram_loss to calculate true_bins and it seems correct) and tried to compare it with structure['distogram_logits'].argmax(dim=-1) from the ESMFold prediction process. However, I found a significant discrepancy:
In the above image, there's a substantial difference between the true_bins[0,:] and disto_logits.argmax(dim=-1)[0,0,:]. The former represents the actual distances between residue 0 and others given by result.pdb, while the latter represents the predicted distances between the same given by structure['distogram_logits'].
Following the approach in esmfold.v1.trunk, I calculated the bins based on predicted positions using recycle_bins = FoldingTrunk.distogram(structure["positions"][-1][:, :, :3], 2.3125, 21.6875, 64,) after recycling. The result is recycle_bins[0,0,:] shown in the above image. I found that recycle_bins is very close to true_bins.
It makes sense that recycle_bins is very close to true_bins since ESMFold constructs "result.pdb" mainly based on structure['positions']. However, I'm curious why predicted distogram_logits differs greatly from true_bins. Since you have not released your training code, I'm uncertain about how you optimize the distogram_head parameters. I would appreciate your clarification on this matter. Thanks!
Thank you for your excellent work! However, I've encountered an issue while using ESMFold (esm2_650M) and I hope you can help clarify.
Firstly, I used ESMFold to predict the structure of a specific sequence("MTILSKDRISLQTSASDKADAIRKAGQLLVATGCVLPEYVDGMLAREQSMSTSLGNGVAIPHGVYENRGHILKTGISVLQLPAGVDWDEGE"). The prediction result (result.pdb) was as expected, with an average pLDDT of around 80. I then attempted to compare the
structure['distogram_logits']
predicted by ESMFold with the final output atomic coordinates:I mainly used
openfold.np.protein.from_pdb_string
,openfold.data.data_pipeline.make_pdb_features
, andopenfold.data.feature_pipeline.np_example_to_feature
methods to calculate the C_beta atomic distances (true_distance
) between residues in the "result.pdb" file given by ESMFold. Upon verification, this step appears correct (thepseudo_beta
value provided by openfold matches with "result.pdb").Then, I calculated
true_bins
using parametersmin_bin=2.3125, max_bin=21.6875, num_bins=64
(I followedopenfold.utils.loss.distogram_loss
to calculatetrue_bins
and it seems correct) and tried to compare it withstructure['distogram_logits'].argmax(dim=-1)
from the ESMFold prediction process. However, I found a significant discrepancy:In the above image, there's a substantial difference between the
true_bins[0,:]
anddisto_logits.argmax(dim=-1)[0,0,:]
. The former represents the actual distances between residue 0 and others given by result.pdb, while the latter represents the predicted distances between the same given bystructure['distogram_logits']
.Following the approach in
esmfold.v1.trunk
, I calculated the bins based on predicted positions usingrecycle_bins = FoldingTrunk.distogram(structure["positions"][-1][:, :, :3], 2.3125, 21.6875, 64,)
after recycling. The result is recycle_bins[0,0,:] shown in the above image. I found thatrecycle_bins
is very close totrue_bins
.It makes sense that
recycle_bins
is very close totrue_bins
since ESMFold constructs "result.pdb" mainly based onstructure['positions']
. However, I'm curious why predicteddistogram_logits
differs greatly fromtrue_bins
. Since you have not released your training code, I'm uncertain about how you optimize thedistogram_head
parameters. I would appreciate your clarification on this matter. Thanks!