PaddlePaddle / PaddleHelix

Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集
Other
1.02k stars 225 forks source link

HelixFold3 predicts unreasonable structures #343

Open Garhorne0813 opened 2 months ago

Garhorne0813 commented 2 months ago

Hello! I noticed that HelixFold3 tends to predict unreasonable structures for regions without templates (as shown in the dashed lines in the image), both in the online service and the open-source version. Do you have any solutions for this issue?

Below is the amino acid sequence of the protein:

MANQALSVSVGNALRRVRSYLFLVRGMGQLLRRRLDPTVRAQPAVIVLSLGSKGSSARVAAAARARGYRVVVFCAELPFAEARYMDHYHRIDCVTDFDKALETARGYAPEAILLEGKNRLLPMQNNLAQTLGVTAVGNAAVKSSNSKIDLHASLDRAGLANLPWEILPEDGRSKLSFPVVSKPDVGTSSMGVQYLDSLDTFRNDKAYWDKVAQDTDIDGQIMLESYIDGRQFDVEGVARDGAFHILTVVEEYYQNAAPYFPPSWFLFNPPIPEEQRARLEKRVEEALKAFGVTVGGWHCESRFSDEKYGDGSLRPGIAGNEIYVLDYANRMGYNQLVSESCGADFAGAYVDTMLPRPFSPPQITRRSVLQIMIRDTETLRRAKALAQARPDVVHRGAFVPFEFSAHTYFGHIVLSCPDFETLRDALAAHDLIPDTWAGFYPDAMAGA

Here is the visualization of HelixFold3's prediction:

20240905151055

magnusbauer commented 2 months ago

I get similar results when the MSA's are shallow. It is a weird issue because sequence order is kept in pdb/cif indexing but the residues in the structure are swapped in atom positions. This is why pymol has a hard time showing it right (dashed lines). It basically puts the residue in the correct position but it has the wrong residue identity. You won't see it in the pdb file sequence order only by following the amino acid chain in the structure residue by residue as shown below.

Screenshot 2024-09-08 at 3 29 07 PM

Fairly commented 2 months ago

@Garhorne0813 @magnusbauer Thank you for the feedback. We are currently investigating the issue. We have observed that these phenomena are more likely to occur with a shallower MSA and shorter sequence, where the predicted plDDT of part of the structure tends to be very low and the model incorrectly connects atoms. While we are still not certain, the problem may be due to underfitting of the model. We are actively working on enhancing it.