chembl / ChEMBL_Structure_Pipeline

ChEMBL database structure pipelines
MIT License
193 stars 38 forks source link

Different parents for same molecule with ring substituents #31

Closed conrad-stork closed 2 years ago

conrad-stork commented 3 years ago

Hi CSP Team (@greglandrum @eloyfelix @apbento),

first of all, a great thanks to your team for providing such a pipeline! I think this is a good step into the right direction to solve the mess in virtual libraries! I hope you will maintain and further develop this project. It is very useful!

I am not sure if I do something wrong, but for this little example with 3 same molecules, only two (the first and the third) get matched by the pipeline:

import chembl_structure_pipeline as csp
from rdkit.Chem import MolFromSmiles, MolToSmiles

def main():
    smis = [
            'c1c(C)c(C)ccc1',
            'c1cc(C)c(C)cc1',
            'c1ccc(C)c(C)c1']

    print('Input SMILES:')
    for smi in smis:
        print(smi)

    mols = [MolFromSmiles(smi) for smi in smis]
    std_mols = [csp.standardize_mol(mol) for mol in mols]
    par_mols = [csp.get_parent_mol(s_mol)[0] for s_mol in std_mols]
    par_smis = [MolToSmiles(p_mol) for p_mol in par_mols]

    print('Parent SMILES:')
    for p_smi in par_smis:
        print(p_smi)

if __name__ == '__main__':
    main()

Perhaps you can provide me some feedback to this example? I have several other examples that get not matched.

Best Conrad

lnrtk commented 2 years ago

Hi Conrad,

without looking deeper into the problem in this case a quick solution would be to canonicalize the smiles, e.g. via

par_smis = [MolToSmiles(MolFromSmiles(MolToSmiles(p_mol))) for p_mol in par_mols]

or something like:

mols = [MolFromSmiles(s) for s in smis]
mols = [MolToSmiles(m) for m in mols]
mols = [MolFromSmiles(s) for s in mols]

But I'm not sure if this should be the expected behaviour. Does this fix apply to your other problems or did you find a better way?

Best regards, Lennart

greglandrum commented 2 years ago

@conrad-stork : the problem here is that the molecules which come back from get_parent_mol() are not fully sanitized. If you sanitize the molecules before generating the SMILES things should work:

In [9]: for m in par_mols:
   ...:     Chem.SanitizeMol(m)
   ...:

In [10]: par_smis = [Chem.MolToSmiles(p_mol) for p_mol in par_mols]

In [11]: par_smis
Out[11]: ['Cc1ccccc1C', 'Cc1ccccc1C', 'Cc1ccccc1C']

@eloyfelix : we should think about adding a call to SanitizeMol() to the pipeline functions which return molecules.

conrad-stork commented 2 years ago

Hi,

thanks to @lnrtk and @greglandrum for the replies.

Both solutions (from Inrtk and greglandrum) are working. I will prefer the one from greglandrum.

From my point of view the issue is solved so I will close it. If you want to add the SanitizeMol greglandrum suggested, you can reopen it again. I would vote for it because I think it would make completely sense to include the "SanitizeMol feature".

Best Conrad

eloyfelix commented 2 years ago

Hi all,

sorry for the silence on this repo. We've been latelly quite involved on infrastructure works at EBI and we are also quite busy on the data side of things. I just want to reassure you that this is and will still be a core piece (hence mantained) of the ChEMBL database.

Thank you @greglandrum for keeping an eye on this. Yes, I think that it makes sense and I think that we could also re-evaluate if using RDKit's sanitization could also be an option for us.

There are few issues that we'd like to sort for ChEMBL31 (v30 to be released in ~1 month) and we'll tackle all of them at once since testing changes on the standardiser on the whole ChEMBL database usually takes us a bit of time.

greglandrum commented 2 years ago

@eloyfelix, just to confirm: the ChEMBL team is just using the _molblock() forms of the functions in your pipeline, right? So if we decided to change some of the behavior of the _mol() functions it wouldn't affect you?

eloyfelix commented 2 years ago

yes, @greglandrum, you're right. We are only using standardizer.get_parent_molblock and standardizer.standardize_molblock in our ingestion pipeline so changing the _mol() functions should be OK for us.