dan2097 / opsin

Open Parser for Systematic IUPAC Nomenclature. Chemical name to structure conversion
https://opsin.ch.cam.ac.uk
MIT License
154 stars 32 forks source link

1-butanol smiles code #34

Closed dan2097 closed 7 years ago

dan2097 commented 8 years ago

Original report by dbgerhard (Bitbucket: dbgerhard, GitHub: dbgerhard).


Hi folks, thanks for this amazing site.

I put in 1-butanol, and I expected the smiles code to be CCCCO but your system produced C(CCC)O. Can you check why your smiles code is inserting a branch? C(CCC)O would suggest a substituent where none exists. Thanks.

dan2097 commented 8 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


The reason for this is that OPSIN's SMILES writer starts writing SMILES from the first atom in the molecule which, due to the order it creates groups in, is the 1 position of the butane. From this atom it can either write out C(CCC)O or the equally ugly C(O)CCC.

While I agree that CCCCO is prettier, C(CCC)O is still the same structure. Hence I'm not sure the added complication/slight computation cost in the SMILES writer is a good trade-off, as SMILES are primarily intended for reading by machine. The hypothetical fix I think would be to try and start on the first terminal atom (if such an atom exists). (on opsin.ch.cam.ac.uk, the primary purpose of the outputs is: the depiction for humans, SMILES for input to other software, StdInChI for checking structure identity [OPSIN's SMILES are NOT canonical, and even if they were StdInChI is better at handling mesomers/tautomers] and StdInChIKey for easily searching for documents mentioning that structure)

dan2097 commented 8 years ago

Original comment by dbgerhard (Bitbucket: dbgerhard, GitHub: dbgerhard).


Hi Daniel, thanks for getting back to me and I appreciate your response. Let me explain my application and see if there's any possible solution.

We were hoping to use your system, in combination with a bunch of other stuff, to produce a large set (1000's) of automatically generated multiple choice and JSME questions for our students learning organic chemistry nomenclature and reactions. Our system begins by randomly generating a correct reaction based on a database of IUPAC names; removes one component, and then generates a series of likely incorrect answers the student can choose from. Our JSME questions require the student to draw the missing molecule, and the question then checks the smiles code generated by JSME against the smiles code generated by opsin from the IUPAC name. I know we're cobbling bits together but given our limited development time and budget we really wanted to avoid writing our own IUPAC-to-smiles parser so when we discovered OPSIN could do this, we were very excited.

The question system itself is based on moodle, and as such requires a perfect match between the student's response (in smiles as produced from JSME) and the smiles code representing the molecule. Since moodle isn't smart enough to know that CCCCO is the same as C(CCC)O.

So we're hoping to match JSME-generated smiles to opsin-generated smiles, and since they produce different variants, the systems can't talk to each other. Do you know of any way to start from smiles code (or IUPAC name) and generate all the possible SMILES variants? We would then need to hard-code each possible variant into the moodle questions to ensure a match happens when the student produces a molecule from JSME.

dan2097 commented 8 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


It isn't going to be possible to get OPSIN to generate the same SMILES as JSME. As well as atom ordering SMILES can also represent aromatic systems in two ways. OPSIN will give C1=CC=CC=C1 for benzene but I think JSME will give you c1ccccc1, again both are the same structure. Canonical SMILES algorithms have been developed to address these issues... but almost all implementations differ from each other so you can compare canonical SMILES generated by one implementation, but they are not comparable to those generated by another! (which is part of the reason why OPSIN doesn't even try and produce canonical SMILES)

If it's possible to hook in another web service, then my recommendation would be to convert the SMILES from JSME to StdInChI using the NCI's resolver (OPSIN can produce StdInChI directly). It can be used RESTfully to perform such conversions e.g. https://cactus.nci.nih.gov/chemical/structure/C(CCC)O/stdinchi https://cactus.nci.nih.gov/chemical/structure/CCCCO/stdinchi

Although probably out of scope for a system where you are either right or wrong, InChI's have the nice property of being layered, so if they are not identical you can work out at what point two compounds differ e.g. stereochemistry, isotopes, hydrogen positions, connectivity, atomic composition.

The same service can also convert SMILES to its canonical SMILES. Some example differences between canonical SMILES and InChI: InChI will consider a nitro group represented as N(=O)=O equivalent to one represented as [N+](=O)[O-]. InChI will consider common tautomers of a compound to be equivalent.

To give an idea of why enumerating the possible SMILES for a structure is not tractable the following might be useful: https://nextmovesoftware.com/blog/2014/07/15/how-do-i-write-thee-let-me-count-the-ways/

The NCI's chemical identifier resolver is more fully documented here: https://cactus.nci.nih.gov/chemical/structure It actually uses OPSIN for converting systematic chemical names... although I think it might be still using quite an old version.