cdk / depict

SMILES Depiction Generator
GNU Lesser General Public License v2.1
54 stars 14 forks source link

Downloaded cdkdepict 0.3 works differently than on the server #8

Closed sauliusg closed 6 years ago

sauliusg commented 6 years ago

Hi, I'm trying to reproduce the http://www.simolecule.com/cdkdepict/depict.html server action locally on my host, but the local version, both from pre-compiled jars and compiled from sources, fails to parse non-kekulisable SMILES, such as 'n1cccc1' or '[Cu]12(Oc3c(C(=[N]2N=C(O1)c1ccc(O)cc1)C)cc(Br)cc3)[n]1ccccc1' ;). My locally installed version fails "... with root cause org.openscience.cdk.exception.InvalidSmilesException: could not parse 'n1cccc1', a valid kekulé structure could not be assigned". Compiling CDK and/or Depict from sources works exactly the same (i.e. raises exception), as does a command-line wrapper around CDK 2.1 or 2.2. While this behaviour apparently stems from the underlying CDK SmilesParser, and seems to be a feature, not a bug, the on-line 'cdkdepict' version mentioned above does parse these SMILES and depicts them nicely ;) (see the attached screen-shot): screenshot from 2018-02-27 17-23-24 Does this mean that the server uses different/newer CDK libraries? If so, would it be possible to have them pushed into the GitHub repo (maybe as an experimental branch)? Exception messages would also be helpful to debug the case, if presented next to the broken image icon. The platform for running the 'cdkdepict' 0.3 was: saulius@koala depict-0.3/ $ java -version openjdk version "1.8.0_151" OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12) OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) saulius@koala depict-0.3/ $ uname -a Linux koala 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux saulius@koala depict-0.3/ $ osname Ubuntu-16.04

Sincerely, Saulius (join("@", ( "grazulis", join(".", ("ibt","lt")))) to mail me directly ;)

johnmay commented 6 years ago

The structure is not valid, I've relaxed the constraint recently. If you rebuild from source you will get the behavior

johnmay commented 6 years ago

Further Reading: http://efficientbits.blogspot.co.uk/2013/12/new-smiles-behaviour-parsing-cdk-154.html https://baoilleach.blogspot.co.uk/2017/08/my-acs-talk-on-kekulization-and.html

sauliusg commented 6 years ago

Which repo/commit are you relaxed constraints in? Pulling ff5ee4e from https://github.com/cdk/depict.git and the recent pulls+builds from https://github.com/johnmay/cdk.git or https://github.com/cdk/cdk.git do not change the behaviour.

sauliusg commented 6 years ago

The structure is not valid The 'n1cccc1' is admittedly not valid, although the web version infers aromaticity correctly for some reason :) The problems come with metal organics like '[Cu]12(Oc3c(C(=[N]2N=C(O1)c1ccc(O)cc1)C)cc(Br)cc3)[n]1ccccc1' or 'c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F' (the main species in http://www.crystallography.net/cod/4106494.html), where aromaticity assumptions should take metal into account. Here, the behaviour of the web site http://www.simolecule.com/cdkdepict/depict.html would be handy; unfortunately, the 'git cloned' version behaves differently. Is there a possibility to obtain .jars and sources running on http://www.simolecule.com/cdkdepict/depict.html ? Regards, S.

schymane commented 6 years ago

The dashed lines indicates that the aromaticity state cannot be determined properly as the SMILES is invalid, shown here: http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=nonehttp://www.simolecule.com/cdkdepict/depict/bow/svg?smi=c1(cc(c2c3cccc(n3%5bCu%5d3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none

http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=n1cccc1&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none

See also discussion here before the rcdk depiction was updated: https://github.com/rajarshi/cdkr/issues/49

Here’s another depiction service, where CDK doesn’t depict (not sure what version they are using) https://apps.ideaconsult.net/ambit2/depict?search=c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F+&smarts=#https://apps.ideaconsult.net/ambit2/depict?search=c1(cc(c2c3cccc(n3%5bCu%5d3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F+&smarts=#

If you use e.g. OpenBabel to convert your SMILES to SMILES, you get these, which depict fine: c1(cc(c2C3CCCC(N3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F [nH]1cccc1

http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=c1(cc(c2C3CCCC(N3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=nonehttp://www.simolecule.com/cdkdepict/depict/bow/svg?smi=c1(cc(c2C3CCCC(N3%5bCu%5d3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none http://www.simolecule.com/cdkdepict/depict/bow/svg?smi=[nH]1cccc1&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=nonehttp://www.simolecule.com/cdkdepict/depict/bow/svg?smi=%5bnH%5d1cccc1&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none

sauliusg commented 6 years ago

The dashed lines indicates that the aromaticity state cannot be determined properly as the SMILES is invalid, shown here: ...

Thanks for the answer! Good to know, I though dashed lines are just a funny way to display aromatic bonds :)

sauliusg commented 6 years ago

Here’s another depiction service, where CDK doesn’t depict (not sure what version they are using) https://apps.ideaconsult.net/ambit2/depict?search=c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F+&smarts=#https://apps.ideaconsult.net/ambit2/depict?search=c1(cc(c2c3cccc(n3%5bCu%5d3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F+&smarts=#

I have checked that service; their behaviour is consistent with the stock cdk 2.* behaviour (the SMILES that can not be kekulised throw exception in SmilesParser).

Which version of CDK are you using in http://www.simolecule.com/cdkdepict/depict.html?

sauliusg commented 6 years ago

If you use e.g. OpenBabel to convert your SMILES to SMILES, you get these, which depict fine: c1(cc(c2C3CCCC(N3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F [nH]1cccc1

This is where the problem is: obabel's SMILES are depicted but they are wrong – obabel "converts" pyridine ring moiety to piperidine (apparently because of perceived N-Cu bond), which the structure does not have. In the X-ray structure both N-containing rings are flat: http://www.crystallography.net/cod/4106494.html . Moreover, the SMILES you cite have H added to the five-memberd wring which is not there (should be Cu-n1cccc1, not [nH]cccc1). Open Babel 2.3.2 on my Ubuntu-16.04 does not do this change. Which version was yours?

The proper way to encode the cod/4106494 structure in SMILES is, IMHO, the following: '[Cu]12(P(c1ccccc1)c1ccccc1)n1c(cc(c1c1[n]2c(ccc1)c1ccccc1)C(F)(F)F)C(F)(F)F'. I'm looking for a way to parse this SMILES string in CDK and to depict it :)

schymane commented 6 years ago

There are often problems with aromatic Ns, can you try getting SMILES in “non-aromatic” notation? e.g. caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C vs c1(=O)c2c(n(C)c(=O)n1C)ncn2C (both of these are depicted as they are both valid, I just want to demonstrate what I mean with the “non-aromatic” notation …). As to your other question, as far as I am aware the CDK Depict uses the latest and greatest version. At least I use it to test the latest functionality…

sauliusg commented 6 years ago

There are often problems with aromatic Ns, can you try getting SMILES in “non-aromatic” notation? e.g. caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C vs c1(=O)c2c(n(C)c(=O)n1C)ncn2C (both of these are depicted as they are both valid, I just want to demonstrate what I mean with the “non-aromatic” notation …).

I see what you mean. Thanks for the tip! I'll try to "kekulize SMILES manually" and see what happens.

schymane commented 6 years ago

I have very little experience with organometallics, but there are two simple examples on the depict website, you may need to define the metal centre? ClPt@SP1([NH3])[NH3] cis-platin O=NCo@([NH3])([NH3])([NH3])N(=O) trans-[Co(NH3)4(NO)2] You may have to go to extended SMILES but John may be more help there than me.

sauliusg commented 6 years ago

As to your other question, as far as I am aware the CDK Depict uses the latest and greatest version. At least I use it to test the latest functionality…

I tried the leading edge :) CDK (2.1 bundle compiled from https://github.com/cdk/cdk.git master affc8d4 commit), but it also throws exception when parsing invalid SMILES, whereas your Web version does not :). Could you e-mail or post me the CDK jar bundle from the server (maybe privately), for a test? I'd like to see if I get the SMILES parsing as on your server, then I know where to look for a difference...

johnmay commented 6 years ago

I've pushed the changes now, but be warned: here be dragons.

Taking a step back, what toolkit did you use to generate the SMILES? As per Noel's (@baoilleach) talk linked earlier some toolkits don't understand the rules and delocalise structures that should not be delocalised.

Valence model != reality - your structure should probably be: [Cu--]12(Oc3c(C(=[N+]2N=C(O1)c1ccc(O)cc1)C)cc(Br)cc3)[n+]1ccccc1

C)cc(Br)cc3)[n%2B]1ccccc1&abbr=on&hdisp=bridgehead&showtitle=false&zoom=1.6&annotate=none)

or if you want not have charges the pyridine can not be delocalised (see Noel's talk on, is it aromatic in real life = yes, can it be in SMILES = no). [Cu]12(Oc3c(C(=[N]2N=C(O1)c1ccc(O)cc1)C)cc(Br)cc3)N1=CC=CC=C1

sauliusg commented 6 years ago

I've pushed the changes now, but be warned: here be dragons.

Great many thanks, you saved my day! Now I see the code and can reproduce the behaviour!

PS. Dragons are OK, we're working to domesticate them :)

sauliusg commented 6 years ago

I've pushed the changes now

BTW could it be that cdk.version 2.2-SNAPSHOT is not yet on the snapshot repo? My 'mvn compile' complains: "Failure to find org.openscience.cdk:cdk-depict:jar:2.2-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots", while it works when taking CDK 2.1-SNAPHOT as a fall-back.

sauliusg commented 6 years ago

Taking a step back, what toolkit did you use to generate the SMILES?

For the 'c1(cc(c2C3=CCC=C(N3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F' string from a COD CIF, I ran 'cif_molecule' from https://github.com/cod-developers/cod-tools, took the largest molecule, then used Open Babel 2.3.2 -- Dec 18 2015 -- 10:48:26 from the Ubuntu-16.04 apt repo to get SMILES (obabel -iCIF -oSMI filter). That SMILES is IMHO wrong ('...c2C3=CCC=C(N...' instead of '...=C2C3=CC=CC(=[N]...', so then I either a) edited the string manually b) loaded it to Avogadro, edited graphically the pyridine ring (deleting H, changing bonds to "aromatic"), saved CML and then used obabel (obabel -iCML -oSMI) to get SMILES, and removed charges from P. Both procedures yield identical SMILES 'c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F' for the http://www.crystallography.net/cod/4106494.html main structure.

johnmay commented 6 years ago

I've pushed 2.2-SNAPSHOT to OSSRH so it should pick it up from the snapshot repo now.

baoilleach commented 6 years ago

Sorry for butting in, but I'd recommend (and I think John would too) that you use the development version of Open Babel for better handling of aromaticity (thanks to @johnmay). Prior to a rewrite around August of last year, we were clearly getting things wrong.

On Ubuntu 16.04, you can "snap install openbabel --channel=edge" to get this. See https://baoilleach.blogspot.co.uk/2017/12/open-babel-in-snap-ii.html and https://baoilleach.blogspot.co.uk/2017/10/open-babel-in-snap.html for some background.

schymane commented 6 years ago

Any idea when those updates are going to make it into the stable release of open babel? How far behind is 2.4.1? Thanks!

sauliusg commented 6 years ago

I've pushed 2.2-SNAPSHOT to OSSRH so it should pick it up from the snapshot repo now.

Thank's a lot! It now works as expected (had to run 'rm -rf ~/.m2/repository/; cd depict; rm -rf .extract/; mvn clean; mvn package; java -jar target/cdkdepict-0.3.jar', though).

sauliusg commented 6 years ago

There are often problems with aromatic Ns, can you try getting SMILES in “non-aromatic” notation? e.g. caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C vs c1(=O)c2c(n(C)c(=O)n1C)ncn2C (both of these are depicted as they are both valid, I just want to demonstrate what I mean with the “non-aromatic” notation …).

I see what you mean. Thanks for the tip! I'll try to "kekulize SMILES manually" and see what happens.

Just in case you are interested: I have manually "kekulised" the aromatic http://www.crystallography.net/cod/4106494.html SMILES, 'c1(cc(c2c3cccc(n3[Cu]3(n12)P(c1ccccc1Oc1c(P3(c2ccccc2)c2ccccc2)cccc1)(c1ccccc1)c1ccccc1)c1ccccc1)C(F)(F)F)C(F)(F)F', as 'C1=(CC(=C2C3=CC=CC(=[N]3[Cu]3([N]12)PC=CC=C1)(C1=CC=CC=C1)C1=CC=CC=C1)C1=CC=CC=C1)C(F)(F)F)C(F)(F)F'. Now both CDK-depict and obabel display them as expected. But... on the https://apps.ideaconsult.net/ambit2/depict both PubChem and Chemical Identifier Resolver fail (empty windows), although they work on the original string (and PubChem I would say produces a reasonable depiction, the one I would expect looking at the crystal structure). Truly metal-organics are tricky :)

Correction:

'C1(=CC(=C2C3=CC=CC(=[N]3[Cu]3([N]12)PC=CC=C1)(C1=CC=CC=C1)C1=CC=CC=C1)C1=CC=CC=C1)C(F)(F)F)C(F)(F)F'

The problem was to write 'C1=(CC(=C2C3...' instead of 'C1(=CC(=C2C3...'.

sauliusg commented 6 years ago

Sorry for butting in, but I'd recommend (and I think John would too) that you use the development version of Open Babel for better handling of aromaticity (thanks to @johnmay). Prior to a rewrite around August of last year, we were clearly getting things wrong.

On Ubuntu 16.04, you can "snap install openbabel --channel=edge" to get this.

Thanks for the hint, @baoilleach! I'm not used to snap and do not trust it enough to run under 'sudo', but the I have a git clone compiled, for tests. It works fine, actually; for the SMILES discussed here it puts radicals either on phosphorus or on pyrole residue carbon C1; both seem "too radical", to my taste. The stock version (obabel 2.3.2) does not do that. Sorry if this is off-topic here...

For production, it is nice to have a referencable version, so we usually use the latest release in APT repos.