connorcoley / rexgen_direct

Template-free prediction of organic reaction outcomes
GNU General Public License v3.0
150 stars 68 forks source link

Atom-mapping from Lowe's Data #10

Closed WesleyyC closed 4 years ago

WesleyyC commented 4 years ago

Hi, I was looking at the original data from Daniel Lowe here but realized the data you folks processed has a full atom-mapping instead of only the mapping for the atoms in the product. I am wondering if you can share the code for such preprocessing. Thanks!

connorcoley commented 4 years ago

The "extra" atom map numbers are just unique numbers to make sure it is defined for every atom, but there is no meaning to the numbering. A code snippet like the following would suffice:

import rdkit.Chem as Chem 
from itertools import chain

def complete_mapping(rxn_smi):
    r_smi,agent,p_smi = rxn_smi.split('>')
    r = Chem.MolFromSmiles(r_smi)
    p = Chem.MolFromSmiles(p_smi)
    max_map = max(a.GetAtomMapNum() for a in chain(r.GetAtoms(), p.GetAtoms()))
    for a in chain(r.GetAtoms(), p.GetAtoms()):
        if not a.GetAtomMapNum():
            a.SetAtomMapNum(max_map+1)
            max_map += 1
    return '>'.join((Chem.MolToSmiles(r), agent, Chem.MolToSmiles(p)))

complete_mapping('[CH3:1][OH:2].[CH3:3][CH2:4]Cl>>[CH3:1][O:2][CH2:4][CH3:3]')
# returns [CH3:1][OH:2].[CH3:3][CH2:4][Cl:5]>>[CH3:1][O:2][CH2:4][CH3:3]
WesleyyC commented 4 years ago

Got it, thanks!

ahseena96 commented 4 years ago

@connorcoley : Any way we could know the source of data, w.r.t. the patent info, for the reactions you used? As in, which year and patent number? We have used the USPTO data for one of our projects - trying to map these reactions to the ones we used (not a simple reaction smiles match, as the atom mappings have changed)

connorcoley commented 4 years ago

Unfortunately that information wasn’t carried through the pipeline. Would it be possible to do that comparison with the atom mapping stripped?

On Tue, Jan 21, 2020 at 03:04 ahseena96 notifications@github.com wrote:

@connorcoley https://github.com/connorcoley : Any way we could know the source of data, w.r.t. the patent info, for the reactions you used? As in, which year and patent number? We have used the USPTO data for one of our projects - trying to map these reactions to the ones we used (not a simple reaction smiles match, as the atom mappings have changed)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/connorcoley/rexgen_direct/issues/10?email_source=notifications&email_token=ABAEXJSNPTD5DJYMVAKQL5LQ62UCNA5CNFSM4JMOVQ22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJO2WRI#issuecomment-576564037, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAEXJUIN63CPOSDAYQIWGDQ62UCNANCNFSM4JMOVQ2Q .

ahseena96 commented 4 years ago

Maybe - will try that out. Thank you.

YH-88 commented 3 years ago

The "extra" atom map numbers are just unique numbers to make sure it is defined for every atom, but there is no meaning to the numbering. A code snippet like the following would suffice:

import rdkit.Chem as Chem 
from itertools import chain

def complete_mapping(rxn_smi):
    r_smi,agent,p_smi = rxn_smi.split('>')
    r = Chem.MolFromSmiles(r_smi)
    p = Chem.MolFromSmiles(p_smi)
    max_map = max(a.GetAtomMapNum() for a in chain(r.GetAtoms(), p.GetAtoms()))
    for a in chain(r.GetAtoms(), p.GetAtoms()):
        if not a.GetAtomMapNum():
            a.SetAtomMapNum(max_map+1)
            max_map += 1
    return '>'.join((Chem.MolToSmiles(r), agent, Chem.MolToSmiles(p)))

complete_mapping('[CH3:1][OH:2].[CH3:3][CH2:4]Cl>>[CH3:1][O:2][CH2:4][CH3:3]')
# returns [CH3:1][OH:2].[CH3:3][CH2:4][Cl:5]>>[CH3:1][O:2][CH2:4][CH3:3]

Hi,Use the method you mentioned to complete the atom mapping in order to map all the atoms of the product. However, when I run, it gets raise Exception(smiles) Exception: CH2:1[OH:40])[OH:39])[OH:38])[OH:12].NH:32[CH3:91])C:90CH:66[O:69]CH:70[CH:72]1[NH:73]C:74=[O:76])[CH3:77])=[O:78])=[O:99])C:82=[O:84])=[O:98])[CH2:93][CH2:94][CH2:95][CH2:96][NH2:97])=[O:92]

After debugging, the reason is found in the rexgen_direct/core_wln_global/mol_graph code block, the content in the image box. Can you tell me what I should do about it?Thank you. 2