connorcoley / rdchiral

Wrapper for RDKit's RunReactants to improve stereochemistry handling
MIT License
151 stars 50 forks source link

Wrong outcome of RunReaction depending on atom numbering #39

Open hesther opened 1 year ago

hesther commented 1 year ago

Hi, I just noticed that the outcome of RunReaction depends on the atom ordering of the input molecule for some edge cases.

For example, applying the template '[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH2;D2;+0:5]-[OH;D1;+0:6]>>[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH;D2;+0:5]=[O;H0;D1;+0:6]' (a simple reduction of an aldehyde to an alcohol next to a double bond, which does not change the double bond or its stereochemistry) to the molecule '[CH3:1][C:2]1=[C:3](/[CH:4]=[CH:5]/[C:6]([CH3:7])=[CH:8]/[CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15])[C:16]([CH3:17])([CH3:18])[CH2:19][CH2:20][CH2:21]1' correctly produces 'CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=O)C(C)(C)CCC1'

while applying the same template to the molecule '[CH3:1][C:13](=[CH:28]\\[CH:48]=[CH:8]\\[C:34](=[CH:21]\\[CH2:49][OH:6])[CH3:35])/[CH:37]=[CH:15]/[C:29]1=[C:2]([CH3:44])[CH2:11][CH2:16][CH2:41][C:39]1([CH3:12])[CH3:24]' (exactly the same as above, only different atom numbering) wrongly produces 'CC1=C(/C=C/C(C)=C/C=C/C(C)=C\\C=O)C(C)(C)CCC1' (i.e. flipped a trans to a cis bond)

Code to reproduce:

import random
from rdkit import Chem
from rdchiral.main import rdchiralRun, rdchiralReaction, rdchiralReactants

smi = '[CH3:1][C:2]1=[C:3](/[CH:4]=[CH:5]/[C:6]([CH3:7])=[CH:8]/[CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15])[C:16]([CH3:17])([CH3:18])[CH2:19][CH2:20][CH2:21]1'
template = '[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH2;D2;+0:5]-[OH;D1;+0:6]>>[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH;D2;+0:5]=[O;H0;D1;+0:6]'

outcomes = rdchiralRun(rdchiralReaction(template), rdchiralReactants(smi), combine_enantiomers=False)
print("Case 1, right outcome:",outcomes)

#Change atom map numbers (but not map numbers 0)
mol = Chem.MolFromSmiles(smi)
x=[x for x in range(1,50)]
random.Random(0).shuffle(x)
x=[0]+x
for a in mol.GetAtoms():
    a.SetAtomMapNum(x[a.GetAtomMapNum()])
smi=Chem.MolToSmiles(mol)

outcomes = rdchiralRun(rdchiralReaction(template), rdchiralReactants(smi), combine_enantiomers=False)
print("Case 2, wrong outcome:",outcomes)

Is there any way this could be fixed in the near future? (Am relying on rdchiral for a big project)

connorcoley commented 1 year ago

That is a very interesting edge case, and an unfortunate bug. It's not immediately clear to me why this would be the case since the bond direction map is built-up from the atom mapping (here). I don't believe RDKit would be making any assumptions or errors about atom map numbers and the directionality of the bond in terms of ENDDOWNRIGHT and ENDUPRIGHT flags. It seems like something must be wrong in my possible_defs definition that is only triggered when the atom mapping is "unusual".

I'm afraid that we don't have anyone actively working on RDChiral at the moment. I will tag @ljn917 in case this is something he encountered when working on the C++ version. My suggestion would otherwise be to drop in some debugging statements into the function that restores bond stereochemistry in that bonds.py file to start to narrow down where the bug is coming from

ljn917 commented 1 year ago

I confirm that the C++ version also has this bug. I encountered an rdkit bug where renumbering changes double bond directions, though I don't think they are related.

I tried to narrow it down a bit by simplifying the reactant structure, but find more interesting behaviors. Please see the script below. It was run with the C++ version. In summary, to trigger this bug, a second double bond (between mapno 9 and 10) must exist, and the atom 9 (mapno) must be substituted, no matter the substituents are symmetric or not.

Update: the two double bonds must be next to each other (in conjugation).

#!/usr/bin/env python3

import random

from rdkit import Chem
from rdchiral.main import rdchiralRun, rdchiralReaction, rdchiralReactants

random.seed(0)

def randomize_mapno(s: str):
    mol = Chem.MolFromSmiles(s)
    x=[x for x in range(1,50)]
    random.shuffle(x)
    x=[0]+x
    for a in mol.GetAtoms():
        a.SetAtomMapNum(x[a.GetAtomMapNum()])
    return Chem.MolToSmiles(mol)

# smi1 = r'[CH3:9][CH2:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15]' # case 5: correct behavior
# smi1 = r'[CH2:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15]' # case 4: correct behavior
smi1 = r'[CH3:8]/[CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15]' # case 3: incorrect behavior
#smi1 = r'[CH3:8][CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15]' # case 3-1: incorrect behavior
# smi1 = r'[CH2:4]=[CH:5]/[C:6]([CH3:7])=[CH:8]/[CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15]' # case 2: incorrect behavior
# smi1 = '[CH3:1][C:2]1=[C:3](/[CH:4]=[CH:5]/[C:6]([CH3:7])=[CH:8]/[CH:9]=[CH:10]/[C:11]([CH3:12])=[CH:13]/[CH2:14][OH:15])[C:16]([CH3:17])([CH3:18])[CH2:19][CH2:20][CH2:21]1' # case 1: incorrect behavior

template = r'[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH2;D2;+0:5]-[OH;D1;+0:6]>>[C:1]/[C:2](-[C;D1;H3:3])=[C:4]/[CH;D2;+0:5]=[O;H0;D1;+0:6]'

outcomes1 = rdchiralRun(rdchiralReaction(template), rdchiralReactants(smi1), combine_enantiomers=False)
print("right outcome:",outcomes1)

#smi2 = r'[CH3:48][CH2:8]\[C:34](=[CH:21]\[CH2:49][OH:6])[CH3:35]'
for cnt in range(100000):
    smi2 = randomize_mapno(smi1)
    outcomes2 = rdchiralRun(rdchiralReaction(template), rdchiralReactants(smi2), combine_enantiomers=False)
    # print(smi2, outcomes2[0])
    if outcomes1[0] != outcomes2[0]:
        print(f'wrong, smi2={smi2}, outcomes2={outcomes2[0]}')
    else:
        # print('correct')
        pass
hesther commented 1 year ago

That's interesting ... I will run a couple of tests tonight, let's see whether we can narrow down which part of the code it comes from (shouldn't be too hard with a couple of print statements)

hesther commented 1 year ago

Update: Identified the issue. If a template specifies one or multiple double bonds that are in a conjugated system and reverses a bond direction (which happens based on the atom order which in turn depends on the map number) we do not reverse all other bond directions in the conjugated system (which would be the correct thing to do). So the copied over bond directions in unchanged parts of the molecule might not match the bond directions set by the template if the system is conjugated. I am now working on a fix (check whether a conjugated system exists and whether changes were made to any bond, if so, correct all bond directions). Will post a PR in the next days

connorcoley commented 1 year ago

Very interesting & good catch! While the test cases are not that exhaustive, would you also be able to add this example into the list?