Open Croydon-Brixton opened 2 months ago
For reference this would be the corrected structure:
Hey @Croydon-Brixton, thanks for using datamol.
Your specific question is a bit tricky. There are 2 reasons why the "fixing" fail on your molecule:
import datamol as dm
from rdkit.Chem.MolStandardize import rdMolStandardize
vm = rdMolStandardize.RDKitValidation()
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
vm.validate(mol) # this is empty, so nothing technically wrong with the molecule in terms of connections.
In other words, the issue is because of the aromaticity perception of RDKit and kekulization failing.
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol.UpdatePropertyCache(strict=False)
Chem.KekulizeIfPossible(mol) # Can't kekulize mol. Unkekulized atoms: 0 1 2 3 5
# which is the pyridine rings, as atom 4 (uncharged Nitrogen) makes it impossible to assign double bonds.
datamol
depends on the perceived current valence of the atoms vs the number of connections the atom can theoretically make. However when loading the molecules, the valence returned for atom 4 is:Therefore the algorithm skip over the atom, since everything looks fine, this is a direct consequence of each "aromatic" bond being perceived as a single connection at this point.
A naive fix for your specific case would be:
import datamol as dm
rxn = "[#7;X3;H0;r:1]>>[n+:1]"
rxn = dm.reactions.rxn_from_smarts(rxn)
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol = dm.sanitize_first(dm.reactions.apply_reaction(rxn, (mol,), product_index=0))
dm.to_image(mol, mol_size=(400, 200), indices=True, use_svg=False)
I doubt however that this solves the main issue. If you can share your goal and how you load the original structure, I am sure there are better and more systematic approaches (including perhaps not using RDKit here) that I can point you towards.
Thank you for the quick and detailed answer @maclandrol !
Yes, I'm looking for a solution for this general problem:
Problem statement: Given a molecule with the following bits of information: (1) heavy atoms (but no hydrogens, these are implicit), (2) bonded structure (single/double/triple/aromatic) is there a way to infer the formal charges and valence states that make it 'valid' (= pass RDKit sanitization) whilst preserving (1) and (2)?
The reason I am asking is that when retrieving ligands from the PDB the most reliable bits of information are the bonded structures of heavy atoms and the hybridization (which translates into the single/double/... flags), but formal charge is often entirely unspecified. I would like to have a way to turn these molecules into valid ones while preserving this information. Does that make sense?
I would need to do this programatically, as it will apply to many structures. I had a look at ChEMBL's pipeline, but this was not able to do the above task either for the example I gave.
Thank you for your input!
Yeah, the SMARTS patterns they have does not cover your case:
Quaternary N [N;X4;v4;+0:1]>>[*+1:1]
is the closest, but unfortunately the N atom does not have "4 visible" connections for RDKit.
If you are loading from PDB and PDB only, then you should consider this:
Alternatively,
This would probably be something useful for the community so I can definitely help implement this.
Thank you for this nice library!
I'm have a question re fixing 'broken' Mols by inferring the correct valences and charges that I was hoping
datamol
could fix for me.If I load NAP structures from examples in the pdb (e.g.
5ocm
) and simply transfer over bond annotations and atoms (formal charge is not specified in this PDB, so I'm assuming 0 charge) I end up with a structure like this:RDkit then fails to load this due to sanitization problems
This molecule can be 'rescued' by assigning a positive charge to nitrogen number 4, but the
datamol
pipeline unfortunately fails to do this:Is there a way to fix this structure computationally with datamol?