biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
641 stars 101 forks source link

[BUG] Biotite's bond inference seems broken for some structures with specific insertion codes #592

Closed ljarosch closed 3 months ago

ljarosch commented 3 months ago

I tried parsing the .cif file for structure 6evv in Biotite, but the molecule iterator and bond perception seem to be broken. To reproduce this behavior:

from biotite.structure.io import pdbx
import biotite.database.rcsb as rcsb
from biotite.structure import molecule_iter

file = rcsb.fetch("6evv", "cif")
cif_file = pdbx.CIFFile.read(file)
atom_arr = pdbx.get_structure(cif_file, model=1, include_bonds=True)

it = molecule_iter(atom_arr)
print(next(it))
print("---------")
print(next(it))

Output:

    L       1H THR N      N         3.990  -31.682  -62.902
    L       1H THR CA     C         4.747  -30.489  -63.429
    L       1H THR C      C         3.744  -29.388  -63.820
    L       1H THR O      O         2.661  -29.662  -64.349
    L       1H THR CB     C         5.656  -30.853  -64.649
    L       1H THR OG1    O         6.737  -29.921  -64.765
    L       1H THR CG2    C         4.873  -30.909  -65.973
---------
    L       1G PHE N      N         4.112  -28.144  -63.555
    L       1G PHE CA     C         3.253  -27.001  -63.843
    L       1G PHE C      C         2.819  -26.919  -65.318
    L       1G PHE O      O         1.637  -26.760  -65.619
    L       1G PHE CB     C         4.007  -25.739  -63.432
    L       1G PHE CG     C         3.189  -24.485  -63.466
    L       1G PHE CD1    C         3.059  -23.757  -64.631
    L       1G PHE CD2    C         2.597  -24.005  -62.314
    L       1G PHE CE1    C         2.319  -22.587  -64.656
    L       1G PHE CE2    C         1.860  -22.837  -62.329
    L       1G PHE CZ     C         1.723  -22.126  -63.502

The molecule iterator returns the first residues as individual molecules even though they're all part of the same chain (the same problem occurs for many other not shown residues too). Also checking the bond list on the C atom of the first residue shows no associated peptide bond to the N atom of the next residue:

print(atom_arr.bonds.get_bonds(2))
(array([1, 3], dtype=uint32), array([1, 2], dtype=uint8))

Maybe a potential reason for this could be the strange insertion code format of this structure which seems to count backwards from H to A(?) This is a relatively serious bug as it would result in parsing way too many disconnected molecules for this entry.

I'm still relatively new to Biotite so would appreciate any advice if there is a better way to infer bonds from PDB entries.

padix-key commented 3 months ago

I can confirm this bug, thanks for reporting. The problem is not directly the insertion code but the fact the those residues with insertion code have the same residue ID. I will create a fix for this.

For now you can set use_author_fields=False in get_structure() to get the expected behavior, as the residue IDs are unique for each residue in a chain in this case.

ljarosch commented 3 months ago

Hi @padix-key, thanks for the quick fix! Biotite seems like a great package and we look forward to giving it a try within the OpenFold project.

padix-key commented 3 months ago

Always happy to help 😃