not recognizing 1,2,4 triazoles

jw-feng commented 7 years ago

Hi David,

I couldn't get taut_enum to standardize the following three 1,2,4 triazoles into a single tautomer.

Input: CC1=NN=C(CC)N1 a CC1=NC(CC)=NN1 b CC1=NNC(CC)=N1 c

Command: exe_DEBUG$./taut_enum -I triazoles.smi -O output.smi --verbose --extended-enumeration --add-smirks-to-name --add-numbers-to-name

Output: CCc1nc([nH]n1)C a STAND_36_1 CCc1nc([nH]n1)C b_1 CCc1[nH]nc(n1)C c_1

Looks like entry #3 was not recognized by SMIRKS STAND_36

Here are my three tautomers: triaozle_tautomers

DavidACosgrove commented 7 years ago

Hi JW, First, sorry for not knowing a better name for you than "JW". Thanks for your interest in taut_enum. You are the first person ever to have contacted me about the program. In fact, it's a couple of years since I've looked at it myself. What you've identified is a severe lack of documentation rather than the program not working as intended, though it's taken me a bit of time to remind myself of how it works to establish this which shows that I really do need to write some better instructions.

The standardise option just puts molecules into a "sensible" tautomer, and is intended to clean things up into an expected form before the enumeration SMIRKS are applied, if that is going to happen. In the case of 1,2,4 triazoles, that just means moving the hydrogen from the isolated (4) nitrogen, to one of the two that are bonded together (1 or 2). If the H is already on 1 or 2, it is not moved. I assume there is a good reason to prefer one of these 2 nitrogens in the general case, but I'm a cheminformatician not a physical chemist so I'm not sure - someone else devised the rules, I just coded them up.

There are 2 enumeration modes, --original-enumeration and --extended-enumeration. For preparing for virtual screening, you should use --original-enumeration. The --extended-enumeration option is used to try and create a set of tautomers that capture less likely tautomers so that when registering a compound into a database, if the chemist draws an unlikely but still feasible tautomer, an exact match in the database, if there is one, is more likely to be found. To use this, you need to run the output from --original-enumeration through the program again. I'm not sure why that is, and I don't think it's at all clear from the README but I have checked the code... So, if you want to get both triazole structures from a, b and c, the command to use is: ../src/exe_DEBUG/taut_enum -I triazoles.smi -O triazoles_orig.smi --add-smirks-to-name --original which gives CCc1nc([nH]n1)C a STAND_36 CCc1[nH]nc(n1)C a STAND_36 ENUM_1 CCc1nc([nH]n1)C b CCc1[nH]nc(n1)C b ENUM_1 CCc1nc([nH]n1)C c ENUM_1 CCc1[nH]nc(n1)C c To get just one representative tautomer that's the same for a, b, and c, use the --canonical-tautomer option, which gives: CCc1nc([nH]n1)C a STAND_36 CCc1nc([nH]n1)C b CCc1nc([nH]n1)C c ENUM_1 There's nothing special about the canonical tautomer, it's just the first in the list of tautomers for each molecule when the SMILES are sorted into alphanumerical order. Indeed, it's conceivable that 2 different 1,2,4 triazoles might give canonical tautomers with the H on different N atoms (1 or 2) depending on how the substituents change the canonical SMILES. The important thing is that the same structure should always give the same SMILES.

So what I need to do is make the documentation clearer, provide some examples of how to use it, and maybe combine the two enumeration modes so that extended enumeration includes original enumeration. I'll aim to get that done over the next week or two.

If any of that doesn't make any sense, please get back to me, Cheers, Dave

jw-feng commented 7 years ago

Hi Dave,

First of all, I want to thank you for open sourcing this code. I go by JW which are initials of my first name 健文。

I am just looking for a canonical tautomer as I need to normalize molecules before using Jameed Hussain's method (J. Chem. Inf. Model. 2010, 50, 339–348) to find all molecular matched pairs. Jameed implemented the GSK method using RDKit (https://github.com/rdkit/rdkit/tree/master/Contrib/mmpa).

The following command worked as expected and generated identical SMILES strings. ./taut_enum -I triazoles.smi -O output.smi --original-enumeration --add-smirks-to-name --canonical-tautomer

Best,

JW

DavidACosgrove commented 7 years ago

Hi JW, I'm glad you got it working. In principle, I believe, the SMIRKS patterns can be used in RDKit as reaction SMARTS if you wanted to build it into your code directly. I had a brief chat with Greg at the UGM last year and he said that in many cases SMIRKS and reaction SMARTS are interchangeable. It's vaguely on my long-term list of things to try. They're in taut_enum_default_standardise_smirks.H. As a freelance consultant/developer these days, I could be persuaded in the traditional manner to raise it up the list!

As for thanks for open sourcing the code, that is really due to AstraZeneca. I did the paperwork and made the case, but it was they who agreed to do it. All the best, Dave

On Mon, Apr 17, 2017 at 6:01 PM, JW Feng notifications@github.com wrote:

Hi Dave,

First of all, I want to thank you for open sourcing this code. I go by JW which are initials of my first name 健文。

I am just looking for a canonical tautomer as I need to normalize molecules before using Jameed Hussain's method (J. Chem. Inf. Model. 2010, 50, 339–348) to find all molecular matched pairs. Jameed implemented the GSK method using RDKit (https://github.com/rdkit/ rdkit/tree/master/Contrib/mmpa).

The following command worked as expected and generated identical SMILES strings. ./taut_enum -I triazoles.smi -O output.smi --original-enumeration --add-smirks-to-name --canonical-tautomer

Best,

JW

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenEye-Contrib/TautEnum/issues/1#issuecomment-294528925, or mute the thread https://github.com/notifications/unsubscribe-auth/AIxdFsxkYhDYkj1hfMIcH1PrGpD2nI2mks5rw5r6gaJpZM4M9dUc .

-- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk

jw-feng commented 7 years ago

Hi Dave,

I did talk to some of my former colleagues at Genentech about including SMIRKS in taut_enum_default_standardise_smirks.H in the Chemalot package that's on Github. At the moment, I am interested in generating a canonical tautomer. To do that, I still need to enumerate and sort by lexical order. I don't see how I do this quickly without reimplementing much of your code in Java or Python.

JW

OpenEye-Contrib / TautEnum

not recognizing 1,2,4 triazoles #1