aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.74k stars 520 forks source link

Strong inaccuracy in calculation of atom positions from frames&torsion-angles #237

Closed moritzschaefer closed 1 year ago

moritzschaefer commented 1 year ago

OpenFold/AF2 provide extensive and welll-documented code bases for working with protein structures, beyond structure prediction purposes. Thanks a lot for this work!

I wanted to make use of OpenFold's implementation for converting betwene atom positions and frames/torsion-angles. To test this, I wrote a little test script that

  1. loads atom positions
  2. calculates AA-based frames and torsion angles
  3. calculates new atom positions from the frames and torsion angles
  4. checks whether the two sets of atom positions are identical (as one would expect)

Although I would expect some numerical errors from angle-based positions calculations, the error is much larger than what seems reasonable: Figure_1

I commented and uploaded the script here along with the example protein (the script is on the bottom): https://gist.github.com/moritzschaefer/2b2c97c266b93d0df2559ab1b9ef512f

I (unsuccesfully) checked multiple things to pinpount the error:

Note: the atom error seems to increase with additional angles (e.g. Chi4-based atoms tend to be more off, as they depend on chi3, chi2 and chi1). The C-alpha atoms show no deviation (RMSD < 1e-8)

Is there something I missed?

gahdritz commented 1 year ago

Have you checked whether this also occurs w/ AF2 code?

moritzschaefer commented 1 year ago

@gahdritz, I now also tried it with the AF2 code and came to a similar result. Here the maximum distance/RMSD between two atoms was 0.56 (so a bit less than in the AF2code).

Would be great if you could have a look if my code looks alright. It's hard to tell from my perspective whether I made a mistake or whether this is a bug

test_alphafold.py

jonathanking commented 1 year ago

I am not very familiar with AlphaFold/OpenFold implementation details, so I could be mistaken here, but I wonder if this issue results from the fact that the structure module uses idealized bond lengths and angles to construct sidechain atoms (see Section 1.8.4 AlphaFold Supplementary Material).

For instance, I see that you are using what seems to be the PDB file as downloaded from the RCSB website. This file likely does not have idealized bond lengths or angles. When converting to frames/angles and then back to coordinates, I am guessing that the new structure is different because it was rebuilt using the idealized angles.

If you repeat the process by taking your resultant structure (after rebuilding), converting it to angles, and converting back to structure, I hope that the final structure will be identical to the first rebuilt structure. I am not quite familiar with the code enough to test this myself.

Of course, perhaps you've already considered this! I am chiming in because I have encountered this in my own work - idealized bond lengths and angles can significantly impact reconstruction loss when using an angle-only representation of protein structure.

moritzschaefer commented 1 year ago

Hey @jonathanking, indeed I overlooked this. I'll check this and report back here. Thank you for the reply!

Edit:

Indeed, when I use my script to create and store an idealized-length structure, and then analyze that idealized-length-structure in a second run (coordinates->angles->coordinates), I do get a much-much lower error (95% of RMSDs < 0.002).

Do you know at which step in their dataset generation pipeline AlphaFold converts their PDBs towards idealized bond lengths? Their Supplementar Materials does not seem to provide any information on this..

jonathanking commented 1 year ago

I'm glad that seemed to explain some of the discrepancies!

No, I don't know when or even if AlphaFold handles idealized bond lengths or angles in the structures they train on. A quick search in their repo for "ideal/idealized" didn't turn up anything particularly telling either.

It's my guess that they simply trained the model to predict the crystal structure/existing PDB files without strictly accounting for the differences between the bond lengths/angles in the target structure and the angles/bonds in their prediction. In this manner, the model can certainly be trained but would have some limitations due to these differences. Perhaps they hoped relaxation would resolve the remaining differences.