TUCAN-nest / TUCAN

A molecular identifier and descriptor for all domains of chemistry.
https://tucan-nest.github.io
GNU General Public License v3.0
22 stars 5 forks source link

graph_from_smiles() missing #85

Open schatzsc opened 1 year ago

schatzsc commented 1 year ago

In tests_chembl.ipynb the function graph_from_smiles() is used to convert a SMILES string to a graph, which is useful when processing ChEMBL and PubChem structures.

This was once included in tucan.io according to from tucan.io import graph_from_smiles

However, it seems like this function got lost somewhere on the way and is not included in molfile_reader.py anymore (although admittedly it also does not make much sense under this name).

I would strongly like to have it back again to tucan.io

I can also provide a graph_from_csd routine (although it requires a local installation of the CSD database) and currently work on a graph_from_pubchem function, since one can also read the atoms and bonds directly (as with the CSD), therefore making the detour via the SMILES string unnecessary, see:

PubChemPy Dictionary representation

pcp

flange-ipb commented 1 year ago

graph_from_smiles was removed in commit af7420be39059e7fd05c08b6cf0704e0d385ccb9 due to its use of RDKit.

schatzsc commented 1 year ago

Thank you very much for pointing to the commit where this was removed. Actually did not remember that it was one of the parts based on RDKit that we dediced to kick out due to problems with metal complex handling.

In the meantime, I also figured out how to access ChEMBL and PubChem directly without "detour" via molfile or SMILES.

Interestingly, PubChem returns a data structure with explicit hydrogens that is extremely easy to convert to a graph, see graph_from_pubchem()

ChEMBL on the other hand returns a data structure without explicit hydrogens with only some very few exceptions needed to handle tautomers, so it is basically the "H-pruned" heavy atom core. Therefore, need the implicit_to_explicit_hydrogen preprocessor here, which has some initial code in implicit_to_explicit_hydrogen_preprocessor() as found in my "TUCAN playground"

schatzsc commented 1 year ago

Still, seems to be only these few lines from the above code section:

from rdkit import Chem

def graph_from_smiles(smiles: str):
    molfile = _molfile3000_from_smiles(smiles)
    element_symbols, bonds = _parse_molfile3000(molfile)
    return graph_from_moldata(element_symbols, bonds)

def _molfile3000_from_smiles(smiles: str):
    m = Chem.MolFromSmiles(smiles, sanitize=False)
    return Chem.MolToMolBlock(m, forceV3000=True, includeStereo=False, kekulize=False)

Even if they "inherit" the issues of the RDKit with metals I'd possibly argue to have a function for that for people to use at own risk?

schatzsc commented 1 year ago

This is one of the things I forgot in the recent discussion - would be nice to also have SMILES as input for TUCAN, which can be done by above code fragment using RDkit SMILES_to_v3000_molfile function

rapodaca commented 1 year ago

I'm curious - aside from the issue metal complexes, why was RDKit removed?

schatzsc commented 1 year ago

Good question - don't really remember the answer anymore since this modification was done more than 6 months ago, but maybe Jan can give feedback.

My best guess is that it was the last dependency on RDKit that we had in the TUCAN and on one hand, we did not need it for anything else anymore, so would simplify dependencies, and then of course "the issue with the metal complexes" is not a minor one.

We had a long developers' meeting today with some new people joining and will further formalize and harmonzie the different input variants in the upcoming months. Also plan for PubChem, ChEMBL and CSD interfaces as well as ORCA and Gaussian computational chemistry file formats as input, plus a lot of other interesting stuff (-:=

On that occassion - you are really missed on Twitter, just realized today that there were new posts in your blog ...