cdk / depict

SMILES Depiction Generator
GNU Lesser General Public License v2.1
54 stars 14 forks source link

How to disable the Thiele ring? #38

Closed nbehrnd closed 2 years ago

nbehrnd commented 2 years ago

This feature request seeks to disable the Thiele ring depicting aromaticity, either a) as an option to toggle-on/off (similar to the CIP labels), or b) to disable this feature for good.

I just submitted the two SMILES strings to CDKDepict's input mask

[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O
CN1[C]2N3C4=C1CO.C3C(=O)OC(=O)c5ccccc5.C2OC.C4(=O)OC(=O)C[S@](=O)(=O)O

which yields two (first structure, vide infra) or one (second structure) dots like for radical compounds in the structure formulae in simultaneous presence of the Thiele ring. Often the Thiele ring has been used to symbolize aromaticity, in case of mononucleous aromatic compounds like benzene, cyclopentadiene, etc. the presence of six \pi electrons. Despite the N-alkylation, however, it would be misleading to assume that there are six \pi electrons plus one (or even two) radical electrons for a total of seven or eight. In addition to offer an easier easier count on electrons, it would yield more consistency in the representation of structure formulae (e.g., the first/left structure used the Thiele ring on the phenyl moiety, the second/right structure does not).

failing_imidazoles

johnmay commented 2 years ago

You need to put in a valid SMILES :-). This happens if there is no valid Kekulé assignment - for example pyrrole is [nH]1cccc1 and NOT n1cccc1. There is no way to localise the bonds unambiguously - in your case there is a radical but it could be on either nitrogen. Essentially we reject it as invalid - but rather than display nothing we can do a little better and show we don't know where bonds go.

Most likely one of those nitrogens is charged?

[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n+](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O
[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n](C)c3[n+]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O

Perhaps it was a Zwiterion?

[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n+](C)c3n1CC(=O)OC(=O)c4ccccc4.C3OC.C2[O-]

image

It's very odd SMILES none the less with the ring closures - where did you get it from? Stereo on the sulphur is also suspect - it's tautomeric so non-stereogenic!

johnmay commented 2 years ago

To clarify after the round trip the fromula has been changed and what ever did the round trip placed the radical on the carbon.

nbehrnd commented 2 years ago

To your reply, I would like to add two comments. Neither one aims to re-open the issue as a problem of CDKDepict.

Point 1, validity of SMILES While copy-pasting these SMILES (and others, not depicted) from a round-trip with Jmol, I wasn't aware some of them were -- to varying degree -- problematic, or incorrect. The submission of the first SMILES string to openbabel (version 3.1.1 as provided by Linux Debian 12/bookworm (branch testing)) only yields a warning but doesn't stop it from processing. The optional use of --errorlevel (entry documentation) doesn't prevent a conversion either:

~$ obabel -:"[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O" --errorlevel 5 -ocan
==============================
*** Open Babel Warning  in ParseSmiles
  Failed to kekulize aromatic SMILES

COC[C]1N(C)C(=C(N1CC(=O)OC(=O)c1ccccc1)C(=O)OC(=O)CS(=O)(=O)O)CO    
1 molecule converted

Thus, the visual representation of the formulae as .svg of the first and second SMILES string look identical (both without the wedges in the sulfonic acid). Since e.g., ChemDraw JS process the problematic SMILES string without warning, I wonder if there is a SMILES checker which would rise a red flag (similar to e.g., checkcif) for severe problems in the syntax of the SMILES string or/and if future versions of OpenBabel should refuse this as input at all. Because e.g., the lack of explicit ring closure already is a valid cause to stop working:

$ obabel -:"C1CCCC" -ocan
==============================
*** Open Babel Warning  in ParseSmiles
  Invalid SMILES string: 1 unmatched ring bonds.

0 molecules converted

Point 2, Thiele ring I agree with your perspective CDKDepict doesn't iron out imperfections and errors in SMILES submitted (point 1). Thus -- by observation -- I'm a bit surprised the two SMILES initially submitted once display the phenyl ring with, and once without the Thiele ring (vide infra, entries 1 and 2), though they both appear at equal distance to the once imidazole ring:

[S@@](O)(=O)(=O)CC(=O)OC(=O)c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O #1
CN1[C]2N3C4=C1CO.C3C(=O)OC(=O)c5ccccc5.C2OC.C4(=O)OC(=O)C[S@](=O)(=O)O #2

CC(=O)OC(=O)c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O #3
OC(=O)c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O #4
c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2O #5
c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3OC.C2 #6
c1c2[n](C)c3[n]1CC(=O)OC(=O)c4ccccc4.C3.C2 #7

c1c2[n](C)c3[n]1CCCCc4ccccc4.C3.C2 #8
c1c2[n](C)c3[n]1Cc4ccccc4.C3.C2 #9
c1c[n](C)c[n]1Cc5ccccc5 #10

c1c[n]c[n]1Cc5ccccc5 #11

Simplification of SMILES 1 (entries 3 to 7) and further modification (entries 8 to 10) lead CDKDepict to retain the Thiele ring, while removal of the N-alkylation ((C) present in entry 1, and 3 to 10) promptly yields the representation as if there were alternating single and double bonds in benzene. For me, it is surprising because both entry 1 and 2 include the explicit ring closure about the phenyl ring with six carbon atoms, only encoded once by c4ccccc4 and the other time c5ccccc5.

2022-03-14_Thiele_ring

johnmay commented 2 years ago

Yeah it should not be a warning if the MF changes. I was aware of this bug!

See an old slide from 2015 - image

However I thought newer versions of Open Babel had fixed this since Noel O'Boyle had done a lot of work on improving kekulisation. On kekulisation failure - one option is you cary on and try and localise the bonds to the other ring systems. However my view kind of is, garbage in, garbage out.

johnmay commented 2 years ago

Thus -- by observation -- I'm a bit surprised the two SMILES initially submitted once display the phenyl ring with, and once without the Thiele ring (vide infra, entries 1 and 2), though they both appear at equal distance to the once imidazole ring.

One has a broken aromatic system one does not. When assigning bond orders the current approach as soon as you see something dogey it gives up.

Simplification of SMILES 1 (entries 3 to 7) and further modification (entries 8 to 10) lead CDKDepict to retain the Thiele ring, while removal of the N-alkylation ((C) present in entry 1, and 3 to 10) promptly yields the representation as if there were alternating single and double bonds in benzene. For me, it is surprising because both entry 1 and 2 include the explicit ring closure about the phenyl ring with six carbon atoms, only encoded once by c4ccccc4 and the other time c5ccccc5.

Yes because removing the N-alkylation make that ring system valid. But you're conflating things here, the issue is valence:

c1c[n]c[n]1Cc5ccccc5 #11
c1c[nH]c[n]1Cc5ccccc5 #12 <-- broken again :-)

Further reading - there is a chapter in my Thesis and Noel has a more approachable talk on the topic https://www.repository.cam.ac.uk/handle/1810/246652 https://baoilleach.blogspot.com/2017/08/my-acs-talk-on-kekulization-and.html

nbehrnd commented 2 years ago

I agree CDKDepict's focus is the visualization of formulae, without intent to correct flaws (gigo). Not knowing about Noel O'Boyle's investment to improve OpenBabel to broadcast warnings, or to prevent questionable conversions altogether, I filed a ticket on OpenBabel on how to adopt a more conservative approach when processing a list of SMILES (cross-link). Thank you for your indication of the literature references.

Since there are checkers hunting (syntax) errors, inconsistencies, omissions for small molecule crystallographic data (e.g., checkcif) which label the problems by severity, I would welcome if there were a similar well curated and documented, widely accepted, and easy to deploy SMILES syntax checker (or, in parlance of the last main slide of the presentation you indicated, a validation suite). Like the one to check the assignment of CIP labels mentioned here, or the one by the InChI trust regarding InChI (in their test.zip here). While I consider myself to lean much more on the side of using, than writing new short scripts to trade with abbreviated structure representations, such a «pedantic automatic SMILES checker» likely would prevent many instances of gigo by educating how to use SMILES the correct way.

johnmay commented 2 years ago

I am co-author and wrote the CIP test script :-) - the lone pair stuff is nothing to do with CIP and should be avoided at all costs unless you need to read/write IUPAC name. There are formalised grammars for SMILES but most of the time it's overkill - in this case it is a semantic and not a syntax issue so would not help.