dan2097 / opsin

Open Parser for Systematic IUPAC Nomenclature. Chemical name to structure conversion
https://opsin.ch.cam.ac.uk
MIT License
153 stars 32 forks source link

Doc bug: When to use lowercase for an OPSIN smiles in an XML file #53

Closed dan2097 closed 6 years ago

dan2097 commented 6 years ago

Original report by Noel O'Boyle (Bitbucket: baoilleach, GitHub: baoilleach).


Hi Daniel,

I'm trying to add an entry to one of the XML files and I'm wondering when to use lowercase. E.g. in arylgroups.xml, there's:

cytosin

This has a ring where only 5 of the 6 atoms are lowercase. How do I decide when to use lowercase in these circumstances?

dan2097 commented 6 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


I think that's actually a mistake and that first N should really be [nH]. The lower case means that an atom can form a double bond to another lower case atom assuming doing so will not violate valency. In IUPAC parlance that the atom has the maximum number of noncumulative double bonds.

An example of a cytosine which probably should have a double bond to that first atom is 5H-cytosine, which is probably currently misinterpreted. I think the reason for my use of N was that historically n and nH were not distinguished which could cause hydrogens to move, in the current code the H is used as a hint as to where to put the hydrogen if one ends up with an odd number of atoms that are eligible for double bonds. cf. SMILES for pyrrol, but note that being an nH doesn't guarantee that that atom will have a hydrogen in the final molecule e.g. 2H-pyrrole

General rule of thumb is that if your tool generates aromatic SMILES, OPSIN should accept them.

dan2097 commented 6 years ago

Original comment by Noel O'Boyle (Bitbucket: baoilleach, GitHub: baoilleach).


Ok, got it. If these are errors, I can probably write a Pybel script to find such cases, if this is helpful i.e. rings where only some of the atoms are marked as aromatic. And maybe even correct them. :-)

dan2097 commented 6 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


I've fixed the obvious cases in that file, in integration testing the only change was in the interpretation of 3,N4-ethenocytosine. I think this now gives a tautomer of the typically given structure, albeit InChI considers them to still be different. (The integration testing actually said the structure was now wrong... as for some reason the ancient version of ChEBI I use for integration testing also was missing a double bond, the current version of ChEBI has it correct). I wouldn't worry too much about the choice of n vs N in OPSIN's resources unless it is effecting the output.