datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
475 stars 51 forks source link

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Open Croydon-Brixton opened 2 months ago

Croydon-Brixton commented 2 months ago

Thank you for this nice library!

I'm have a question re fixing 'broken' Mols by inferring the correct valences and charges that I was hoping datamol could fix for me.

If I load NAP structures from examples in the pdb (e.g. 5ocm) and simply transfer over bond annotations and atoms (formal charge is not specified in this PDB, so I'm assuming 0 charge) I end up with a structure like this:

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

# The correct smiles would be:
smi_correct = "c1cc(c[n+](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O-])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

Screenshot 2024-09-02 at 19 19 10

RDkit then fails to load this due to sanitization problems

mol = Chem.MolFromSmiles(smi)  # < fails
mol = Chem.MolFromSmiles(smi, sanitize=False)  # <works and produces the structure above, which is an invalid molecule

This molecule can be 'rescued' by assigning a positive charge to nitrogen number 4, but the datamol pipeline unfortunately fails to do this:

import datamol as dm

# Standardize and sanitize
mol = Chem.MolFromSmiles(smi, sanitize=False)
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(mol)
Chem.SanitizeMol(mol)

Is there a way to fix this structure computationally with datamol?

Croydon-Brixton commented 2 months ago

For reference this would be the corrected structure: Screenshot 2024-09-02 at 19 24 03

maclandrol commented 2 months ago

Hey @Croydon-Brixton, thanks for using datamol.

Your specific question is a bit tricky. There are 2 reasons why the "fixing" fail on your molecule:

  1. RDKit Validation says the molecule is fundamentally fine, while sanitization fails. In fact the only reason why the molecule fails to parse is because of kekulization.
import datamol as dm
from rdkit.Chem.MolStandardize import rdMolStandardize

vm = rdMolStandardize.RDKitValidation()
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
vm.validate(mol) # this is empty, so nothing technically wrong with the molecule in terms of connections. 

In other words, the issue is because of the aromaticity perception of RDKit and kekulization failing.

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol.UpdatePropertyCache(strict=False)
Chem.KekulizeIfPossible(mol) #  Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 5
# which is the pyridine rings, as atom 4 (uncharged Nitrogen) makes it impossible to assign double bonds. 
  1. Charge correction as done by datamol depends on the perceived current valence of the atoms vs the number of connections the atom can theoretically make. However when loading the molecules, the valence returned for atom 4 is:

Therefore the algorithm skip over the atom, since everything looks fine, this is a direct consequence of each "aromatic" bond being perceived as a single connection at this point.

A naive fix for your specific case would be:

import datamol as dm
rxn = "[#7;X3;H0;r:1]>>[n+:1]"
rxn  = dm.reactions.rxn_from_smarts(rxn)
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol = dm.sanitize_first(dm.reactions.apply_reaction(rxn, (mol,), product_index=0))
dm.to_image(mol, mol_size=(400, 200), indices=True, use_svg=False)

image

I doubt however that this solves the main issue. If you can share your goal and how you load the original structure, I am sure there are better and more systematic approaches (including perhaps not using RDKit here) that I can point you towards.

Croydon-Brixton commented 2 months ago

Thank you for the quick and detailed answer @maclandrol !

Yes, I'm looking for a solution for this general problem:

Problem statement: Given a molecule with the following bits of information: (1) heavy atoms (but no hydrogens, these are implicit), (2) bonded structure (single/double/triple/aromatic) is there a way to infer the formal charges and valence states that make it 'valid' (= pass RDKit sanitization) whilst preserving (1) and (2)?

The reason I am asking is that when retrieving ligands from the PDB the most reliable bits of information are the bonded structures of heavy atoms and the hybridization (which translates into the single/double/... flags), but formal charge is often entirely unspecified. I would like to have a way to turn these molecules into valid ones while preserving this information. Does that make sense?

I would need to do this programatically, as it will apply to many structures. I had a look at ChEMBL's pipeline, but this was not able to do the above task either for the example I gave.

Thank you for your input!

maclandrol commented 2 months ago

Yeah, the SMARTS patterns they have does not cover your case:

Quaternary N [N;X4;v4;+0:1]>>[*+1:1] is the closest, but unfortunately the N atom does not have "4 visible" connections for RDKit.

If you are loading from PDB and PDB only, then you should consider this:

  1. Try: https://github.com/jensengroup/xyz2mol to check if you could leverage that.

Alternatively,

  1. Fetch PDB
  2. Separate ligand from target (you need some flexible logic here)
  3. Parse ligand name or ID then make a request to ligand expo to get information about the ligand and its structure. Note, people mis-annotate ligand names all the time in RCSB however.
  4. Align the ligand PDB template to the structure provided by ligand expo
  5. Get the ligand structure and conformation.

This would probably be something useful for the community so I can definitely help implement this.