This PR increases the speed in which PDBx files with structure + bonds written by Biotite can be read:
Currently, pdbx.set_structure(include_bonds=True) writes all inter-residue bonds, to be sure that all required bonds are written, because the PDBx dictionary describes the bonds that are stored in this category rather vaguely:
Nonstandard residue linkage. The LINK records specify connectivity between residues that is not implied by the primary structure.
This means that the struct_conn category can become much larger than in the original from the PDB.
For parsing the struct_conn category again, each of its rows is matched against each row in atom_site. While this can be done in a fast vectorized manner using (n_atoms x n_bonds) boolean matrices, this does not scale well when struct_conn is 'verbose' as described above: Now these matrices effectively have approximately the shape (n_atoms x n_residues), because there is one bond for each residue linkage. As n_residues is approximately proportional to n_atoms, the shape of the boolean matrices and thus the time complexity becomes O(n^2). Therefore, the time for parsing inter-residue bonds explodes for larger structures.
The solution of this PR is to write less inter-residue bonds to struct_conn: While the specification of the category is no very precise, backbone bonds between adjacent canonical amino acids/nucleotides can be definetely excluded from the category. Filtering these out, renders the size of struct_conn much smaller and, more importantly, it does not scale with the number of atoms anymore.
This PR increases the speed in which PDBx files with structure + bonds written by Biotite can be read:
Currently,
pdbx.set_structure(include_bonds=True)
writes all inter-residue bonds, to be sure that all required bonds are written, because the PDBx dictionary describes the bonds that are stored in this category rather vaguely:This means that the
struct_conn
category can become much larger than in the original from the PDB.For parsing the
struct_conn
category again, each of its rows is matched against each row inatom_site
. While this can be done in a fast vectorized manner using(n_atoms x n_bonds)
boolean matrices, this does not scale well whenstruct_conn
is 'verbose' as described above: Now these matrices effectively have approximately the shape(n_atoms x n_residues)
, because there is one bond for each residue linkage. Asn_residues
is approximately proportional ton_atoms
, the shape of the boolean matrices and thus the time complexity becomesO(n^2)
. Therefore, the time for parsing inter-residue bonds explodes for larger structures.The solution of this PR is to write less inter-residue bonds to
struct_conn
: While the specification of the category is no very precise, backbone bonds between adjacent canonical amino acids/nucleotides can be definetely excluded from the category. Filtering these out, renders the size ofstruct_conn
much smaller and, more importantly, it does not scale with the number of atoms anymore.