biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
685 stars 102 forks source link

Omit 'standard' bonds when writing `struct_conn` category #678

Closed padix-key closed 1 month ago

padix-key commented 1 month ago

This PR increases the speed in which PDBx files with structure + bonds written by Biotite can be read:

Currently, pdbx.set_structure(include_bonds=True) writes all inter-residue bonds, to be sure that all required bonds are written, because the PDBx dictionary describes the bonds that are stored in this category rather vaguely:

Nonstandard residue linkage. The LINK records specify connectivity between residues that is not implied by the primary structure.

This means that the struct_conn category can become much larger than in the original from the PDB.

For parsing the struct_conn category again, each of its rows is matched against each row in atom_site. While this can be done in a fast vectorized manner using (n_atoms x n_bonds) boolean matrices, this does not scale well when struct_conn is 'verbose' as described above: Now these matrices effectively have approximately the shape (n_atoms x n_residues), because there is one bond for each residue linkage. As n_residues is approximately proportional to n_atoms, the shape of the boolean matrices and thus the time complexity becomes O(n^2). Therefore, the time for parsing inter-residue bonds explodes for larger structures.

The solution of this PR is to write less inter-residue bonds to struct_conn: While the specification of the category is no very precise, backbone bonds between adjacent canonical amino acids/nucleotides can be definetely excluded from the category. Filtering these out, renders the size of struct_conn much smaller and, more importantly, it does not scale with the number of atoms anymore.