choderalab / espaloma

Extensible Surrogate Potential of Ab initio Learned and Optimized by Message-passing Algorithm 🍹https://arxiv.org/abs/2010.01196
https://docs.espaloma.org/en/latest/
MIT License
212 stars 23 forks source link

Node featurization does not properly assign hybridization using OpenEye toolkit #143

Closed kntkb closed 1 year ago

kntkb commented 1 year ago

Problem

  1. Current implementation does not properly assign atom hybridization using openeye toolkit. I think you need to first run oechem.OEAssignHybridization() to assign the atom hybridization. However, I could not find it in read_homogeneous_graph.py where I assume is responsible for creating the node features.

  2. Regarding the discussions in espaloma_charge issue #18 should we exclude atom.GetValence(), atom.GetExplicitValence(), and perhaps atom.GetFormalCharge() from the input features? Also, do we need atom.GetIsotope()?

  3. RDKit and OpenEye toolkit gives different node features. Also the HybridizationType in RDKit is not fully supported which will raise an error if hydrogen atom is passed.

Reproduce (atom hybridization):

from rdkit import Chem
from openeye import oechem

HYBRIDIZATION_RDKIT = {
    Chem.rdchem.HybridizationType.SP: [1, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.SP2: [0, 1, 0, 0, 0],
    Chem.rdchem.HybridizationType.SP3: [0, 0, 1, 0, 0],
    Chem.rdchem.HybridizationType.SP3D: [0, 0, 0, 1, 0],
    Chem.rdchem.HybridizationType.SP3D2: [0, 0, 0, 0, 1],
    Chem.rdchem.HybridizationType.S: [0, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.UNSPECIFIED: [0, 0, 0, 0, 0],
    Chem.rdchem.HybridizationType.OTHER: [0, 0, 0, 0, 0],
}

HYBRIDIZATION_OE = {
    oechem.OEHybridization_sp: [1, 0, 0, 0, 0],
    oechem.OEHybridization_sp2: [0, 1, 0, 0, 0], 
    oechem.OEHybridization_sp3: [0, 0, 1, 0, 0],
    oechem.OEHybridization_sp3d: [0, 0, 0, 1, 0],
    oechem.OEHybridization_sp3d2: [0, 0, 0, 0, 1],
    oechem.OEHybridization_Unknown: [0, 0, 0, 0, 0],
}

# define smiles and add hydrogen
smi = '[O-]C(c1ccccc1)=O'
mol = Chem.MolFromSmiles(smi)
mol = Chem.AddHs(mol)
smi = Chem.MolToSmiles(mol)
print('# Smiles with hydrogen')
print(smi)
print('---')

# rdkit
print('# RDKIT')
rdmol = mol
for atom in rdmol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHybridization(), HYBRIDIZATION_RDKIT[atom.GetHybridization()])
print('---')

# openeye: w/o hybridization assignment
oemol = oechem.OEGraphMol()
oechem.OESmilesToMol(oemol, smi)
print('# OpenEye (w/o hybridization assignment)')
for atom in oemol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHyb(), HYBRIDIZATION_OE[atom.GetHyb()])
print('---')

# openeye: w/ assign hybridization
oechem.OEAssignHybridization(oemol)
print('# OpenEye: w hybridization assignment')
for atom in oemol.GetAtoms():
    print(atom.GetAtomicNum(), atom.GetHyb(), HYBRIDIZATION_OE[atom.GetHyb()])

output:

# Smiles with hydrogen
[H]c1c([H])c([H])c(C(=O)[O-])c([H])c1[H]
---
# RDKIT
8 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
6 SP2 [0, 1, 0, 0, 0]
8 SP2 [0, 1, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
1 UNSPECIFIED [0, 0, 0, 0, 0]
---
# OpenEye (w/o hybridization assignment)
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 0 [0, 0, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
---
# OpenEye: w hybridization assignment
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
8 2 [0, 1, 0, 0, 0]
8 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]
6 2 [0, 1, 0, 0, 0]
1 0 [0, 0, 0, 0, 0]

rdkit version: '2022.03.4' openeye version: '2022.1.1'

kntkb commented 1 year ago

I think fixing the atom hybridization using OpenEye toolkit is straight forward. You just need to add oechem.OEAssignHybridization(mol) somewhere before creating the input features. For example, right after you initialize the graph (read_homogeneous_graph.py#L177) assuming that the mol in def from_oemol() is a oechem.OEMol object.

kntkb commented 1 year ago

I'm closing this issue because espaloma does not use openeye toolkit to create the input features. It uses rdkit in the backend.