MobleyLab / chemper

Repository for Chemical Perception Sampling Tools
MIT License
19 stars 10 forks source link

Investigate SMARTS/SMIRKS syntax rules in OE/RDK #34

Closed bannanc closed 5 years ago

bannanc commented 5 years ago

So I discovered a difference in RDK and OpenEye in how they use the decorator Xn for connectivity.

In OpenEye, there is no default for Xn which means X cannot be used by itself as a decorator. For example, the SMIRKS "[*;X:1]" is not valid in OpenEye.

In trying to test ChemicalEnvironments which I mostly copied from openforcefield (for now, I'll use that as a dependency when it supports RDKit I think), the tests failed because the SMIRKS "[*;X:1]" was supposed to be used as a way to check that the ChemicalEnvironments raised an error since I thought it wasn't a valid SMIRKS when writing the tests originally.

In the Daylight Manual X has a default value of 1 so if you use it without a number X is the equivalent to X1. This is another example where OpenEye deviates from Daylight, but not in a way that is significant for us since we don't really want to use decorators without the number specified.

This issue isn't the best place for this concern, but I'd like to make sure there aren't other deviations and then them documented somewhere for OpenFF so other people don't have to find them the hard way.

bannanc commented 5 years ago

OK, so more oddities related to this. I tried changing my local test SMIRKS to "[Z:1]". "Z" is NOT a valid SMARTS/SMIRKS/SMILES symbol, but RDKit parses it.

In chemper I used "]X[" to test a SMIRKS that fails, and it works with both RDKit and OpenEye.

Next thought process:

  1. maybe RDKit doesn't care what is inside the brackets it just won't match anything? Nope - '[m:1]' isn't parseable, but '[z:1] is parseable?

  2. Are the things that are parsing metals/less common elements? This thought came because I tested the whole lowercase alphabet and "[b:1]" is valid, but that could be an aromatic boron. However, it wouldn't make sense for Z or z.

  3. Can I find anything in the documentation for RDKit's rules? OpenEye seems to follow this list consistently including element symbols (lowercase for aromatic) like in SMILES. RDKit doesn't have an easy to find list, but it does actually use z and Z from their website

    • Heteroatom neighbor queries:
      • the atom query z matches atoms that have the specified number of heteroatom (i.e. not C or H) neighbors. For example, z2 would match the second C in CC(=O)O.
      • the atom query Z matches atoms that have the specified number of aliphatic heteroatom (i.e. not C or H) neighbors.

Z is not in the Daylight documentation.

bannanc commented 5 years ago

I have left this open because of a conversation with Pat Walters at the OpenFF meeting where he implied there were issues other than our known R vs x in RDKit and OpenEye's SMARTS implementations. I think we've shown that all of the decorators chemper is using are consistent in both toolkits between here and the openforcefield toolkits, but it will always be good practice to make sure of these are consistent when possible.