Closed rwxayheee closed 3 weeks ago
This is awesome
I will merge a little later today, after some visual inspection. Thanks for the approval! ^^
Some additional notes for future reference:
Since bond order and formal charge can't be parsed from Amber OFF lib, it's not guaranteed that the Smiles strings in the chemical templates are the preferred/intended resonance form. For example, Amber residue 1MA: https://github.com/forlilab/Meeko/blob/ae23f949c642a17291972e70f779ea3d4ebfe8c3/meeko/data/residue_chem_templates.json#L1304-L1307 In the creation (guess) of conjugate bond system, a connected graph of atoms needing valence without changing its formal charge is considered, and the double bonds are first placed on the longest Eulerian path with an even number of edges. This is impossible when there are more than 2 odd (1-degree) nodes. 1MA is an example that contains a subgraph that has 3 odd nodes. The current compromise strategy is to remove the closest leaf node to a high-degree node. In 1MA, the valence of the removed node is conpensated by increasing the bond order with and upcharging the nitrogen in -NH2. This process doesn't pick a particular nitrogen.
In short, it's understood that the Smiles could sometimes be alternatively written with a different resonance form. But this doesn't seem to really affect the matching we do in Meeko.
This is for #210. It only includes addition of chemical templates, and it does not change the matching of existing residue names. In this PR, 107 new ambiguous residue names and 567 unique templates are added to the default chemical template file,
residue_chem_templates.json
. It's also possible to distribute the new templates by libraries, or as a separate file.The technical details of the additional templates are as follows:
Disambiguation The purpose of putting additional templates into the default template file is disambiguation in case of conflicts between Amber residue names and CCD names. All possible matches and the variants are registered under the same parent (an ambiguous residue name).
Source of additional residues The following Amber OFF lib files in Amber24 are picked as source of additional residues:
The outcomes are combined in a non-overwriting manner, as residues with duplicate names are skipped. 14 Residues with unsupported elements and corrupt atomic numbers were discarded.
Processing of Amber residues The lib files are parsed with ParmEd, and the rdkit molecule is created by a constructor in chemtempgen. It should be noted that Amber OFF lib is not the ideal file type to generate a residue's Smiles, as the valence and formal charge are not available from the files. A graph-based method is used to guess double bonds with some hints from atom types. Contrary to the processing of chemical components from CCD, no deprotonation occurred in the processing of Amber residues.
Suffix explained
NLE: Residue NLE from Amber lib, forged into the embedded linking fragment NLE_N: Residue NLE from Amber lib, as an N-term residue NLE_C: Residue NLE from Amber lib, as a C-term residue. This variant usually has C(=O)[H], because residues from Amber usually do not have a full carboxylate/phosphate group. NLE_fl-ccd: Residue NLE from CCD, in the free-ligand form. No embedding was made NLE_C-ccd: Residue NLE from CCD, as a C-term residue. This variant usually has C(=O)[O-]