dan2097 / opsin

Open Parser for Systematic IUPAC Nomenclature. Chemical name to structure conversion
https://opsin.ch.cam.ac.uk
MIT License
158 stars 32 forks source link

One vs. two word esters #4

Closed dan2097 closed 13 years ago

dan2097 commented 13 years ago

Original report by Steve Chapman (Bitbucket: isomerdesign, ).


Omitting the space makes a difference:

[9-Hydroxy-6-methyl-3-(5-phenylpentan-2-yl)oxy-5,6,6a,7,8,9,10,10a-octahydrophenanthridin-1-yl]acetate

[9-Hydroxy-6-methyl-3-(5-phenylpentan-2-yl)oxy-5,6,6a,7,8,9,10,10a-octahydrophenanthridin-1-yl] acetate

Simper cases exhibit the same behaviour: hexylacetate vs hexyl acetate.

dan2097 commented 13 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


In IUPAC nomenclature formally only the space separated version is an ester. In the non-space separated case I do not think there is sufficient information to determine that the ester interpretation was intended. The absence of a counter ion does make the non-ester suspicious but ultimately if someone wants to talk about an "[9-Hydroxy-6-methyl-3-(5-phenylpentan-2-yl)oxy-5,6,6a,7,8,9,10,10a-octahydrophenanthridin-1-yl]acetate" ion they should be able to. Hence I'm leaning towards working as intended. There probably is room for improvement in ester names with more than one substituent e.g. "ethyl2-aminoacetate" which clearly was intended to be an ester even though the space is missing.

dan2097 commented 13 years ago

Original comment by Steve Chapman (Bitbucket: isomerdesign, ).


I agree with each of your points. The worry in this case, for instance, is that substance is listed (correctly) in the Misuse of Drugs Act but incorrectly in the ACMD report that recommended its addition: http://www.homeoffice.gov.uk/publications/alcohol-drugs/drugs/acmd1/acmd-report-agonists?view=Binary, causing confusion.

I suppose what I'd like is a Google-type intervention of the "did you mean finite state machine" when one mistypes finite stale machine, or //some //indication the name is suspect.

Another concern is the missing locant defaults to 2, e.g.. phenyldecanoate = 2-phenyldecanoate. Omitting a locant seems increasingly frowned upon by IUPAC unless there is pretty much no possible ambiguity. Not so here. Consider the difference between 3-hexyl decanoate = hex-3-yl decanoate and 3-hexyldecanoate = 3-(hexyl)decanoate. Even if a missing locant does not defeat the parser////, couldn't it whine a little about it?

dan2097 commented 13 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


Adding detection for ambiguity would be nice although to do so rigorously is not completely straightforward e.g. hexyl is not ambiguous even thought there are non-equivalent carbons from which a carbon could be removed. I would be keen if ambiguity detection were to be introduced to keep to an absolute minimum the amount of false positives. A charge imbalance could be a good reason to produce a warning (although in some databases such structures do exist), but to actually suggest a cause/solution would require adding a rule to detect this particular problem.

While I would be happy to accept contributions to this area of the project I don't think I am going to be able to find the time to look into it personally (my PhD is currently focusing on the automatic extraction of chemical reactions). I have started looking into the fused ring numbering problem you brought up and will update your original post when/if my generalisation of the code is successful.

dan2097 commented 13 years ago

Original comment by Steve Chapman (Bitbucket: isomerdesign, ).


Thank you, Daniel. I agree it's not a pressing issue--I just felt it should be noted, really. The fused ring numbering problem is more important.

dan2097 commented 13 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


I'm not sure whether or not its more important but from a completionist point of view the deficiency in fused ring numbering is very annoying. The version of fused ring numbering I am playing with currently works with 3,4,5,6 membered rings in all combinations and ring sizes >6 involved in 2 or fewer rings. The code for aligning the ring system in the directions with most rings in a line seems to not work quite right yet with 5 member rings and possibly only considering two different variants of the 5 membered rings may not be sufficient for systems where the 5 membered ring is not part of the row with most rings.

dan2097 commented 13 years ago

Original comment by Daniel Lowe (Bitbucket: dan2097, GitHub: dan2097).


I have added heuristics for treating cases where the space is missing as esters. This version is now up on the web service for testing. The heuristics are:

The lattermost rule is required as there is only one possible position for substitution on these structures.

The detection of ambiguity is pretty good although not completely fool-proof (due to things like double bonds not having been formally assigned yet rather than problems with the atom environment perception algorithm). I'm a bit dubious about this heuristic as it can result in different interpretations of otherwise very similar names e.g. diethylmalonate -->not ester, diethylsuccinate -->ester, but ethylsuccinate --> not ester (as the position for the ethyl is unambiguous)