biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
660 stars 101 forks source link

missing bonds if use_author_fields=False when two same CCD together in wwPDB .cif file #553

Open 0ut0fcontrol opened 6 months ago

0ut0fcontrol commented 6 months ago
5hu8 use_author_fields=True use_author_fields=False
/tmp/5hu8.cif /tmp/chain_j_author.mol /tmp/chain_j_label.mol
# biotite=0.40.0
import biotite.structure.io.pdbx as pdbx
from biotite.database import rcsb
import biotite.structure as struc
import biotite.structure.io as strucio

try:
    cif_file = pdbx.CIFFile.read(f"/tmp/5hu8.cif")
except:
    rcsb.fetch("5hu8", "cif", target_path="/tmp")
    cif_file = pdbx.CIFFile.read("/tmp/5hu8.cif")

# label_asym_id = author_asym_id = J
# NAG-NAG-FUC
structue_author = pdbx.get_structure(cif_file, model=1, use_author_fields=True, include_bonds=True)
chain_J_author = structue_author[structue_author.chain_id == "J"]
residue_num = struc.get_residue_count(chain_J_author)
print("author:", residue_num)
# output author: 3
strucio.save_structure("/tmp/chain_j_author.mol", chain_J_author)

structue = pdbx.get_structure(cif_file, model=1, use_author_fields=False, include_bonds=True)
chain_J = structue[structue.chain_id == "J"]
residue_num = struc.get_residue_count(chain_J)
print("mmcif label:", residue_num)
# mmcif label: 2
strucio.save_structure("/tmp/chain_j_label.mol", chain_J)
image

How to get bonds right when use_author_fields=False?

0ut0fcontrol commented 6 months ago

I want to use label_seq_id and bond order from mmcif.

replacing res_id with auth_seq_id in get_structure(), make bonds right when set use_author_fields=False.

image

It is ugly 😂.

Better way to handle this?

padix-key commented 6 months ago

Hi, thanks for the report. I dived a bit into this problem and my conclusion is that if use_author_fields=False both NAGs are considered as the same residue: The start of a new residue is automatically detected when one of chain_id, res_id, ins_code or res_name changes. As you see in the CIF nothing of these points are true in this edge case. Hence, both NAGS are considered the same residue. As a consequence, only the first occurrence of each atom is connected to its corresponding bond partner.

I would consider this a bug in get_residue_starts(). However, this one is rather cursed to fix: We could add an additional condition, which checks if the same atom name appears again in a residue. However, this check would be rather computationally expensive, so I assume it would slow down a number of functionalities in Biotite, such as adding bonds to structures.

For now a solution for your concrete problem would be to use author_fields=True, and add the label_* fields of interest to extra_fields. If you want you can later overwrite the annotation arrays with the label_* ones, e.g.

atoms.chain_id = atoms.label_asym_id
0ut0fcontrol commented 6 months ago

Your solution appears to be cleaner (with no biotite source code alterations). Thank you!