OpenEye-Contrib / TautEnum

Program to create tautomers and ionisation states relevant to physiological pH.
Other
9 stars 5 forks source link

Description

This project contains the means to build the program taut_enum, and two helper programs mol_diff_viewer and mol_diff_viewer2.

Taut_enum uses sets of SMIRKS to standardise input molecules and optionally perform tautomer enumeration. Initially its purpose was to prepare structures for virtual screening although like any good tool it has been stretched beyond that. Taut_enum has its origins in a program called Leatherface written by Peter Kenny(1). This used SMARTS patterns and an associated control language to do molecular editing and somewhat pre-dated the SMIRKS language. SMIRKS(2) is a language for describing reactions and is more convenient for standardisation and tautomer enumeration, not least because the documentation is already written by someone else.

The taut_enum program has 'canned' SMIRKS descriptions for standardisation and two levels of tautomer enumeration and a prediction of likely protonation states at physiological pH. It can also be used for more general transformations using SMIRKS files provided by the user.

For preparing a file for virtual screening, it is best run on 2D structures, and the command would be: ./taut_enum -I chembl_20_first_10000.smi \ -O chembl_20_first_10000_VS.smi \ --original-enumeration \ --strip-salts \ --enumerate-protonation

The input file, a piece of Chembl version 20, is in the directory test_dir.

The original-enumeration mode is intended to be used for exploring tautomers likely to be found at physiological pH, so as to make structures suitable for virtual screening. There is another mode, extended-enumeration that does a much more extensive set of transformations. This is intended to be used as part of a compound equivalence identification, such as you'd use in a compound database. Before adding a compound structure to a database, you would probably want to check that it isn't already in there. To make this as reliable as possible, it's wise to assume that two chemists might not agree on an appropriate tautomer, and explore a wide range of possibilities.

Taut_enum also has an option to produce a canonical tautomer. There's nothing magical or chemically sensible about this. It merely enumerates all tautomers according to the standard or extended rules, sorts the canonical SMILES strings of the tautomers so produced into alphanumerical order, and returns the first in the list. In principle this can produce two molecules with the same functional group in different tautomers in the canonical tautomer if the rest of the molecule creates different sort orders for the SMILES strings.

In the test_dir directory there's a script run_taut_enum.sh which shows the different enumeration modes being run on triazoles.smi.

Running the program with no command-line arguments or --help will produce a full list of options with some associated explanation.

As well as the program taut_enum itself, there are 2 helper programs mol_diff_viewer and mol_diff_viewer2. These were thrown together to help with devlopment of taut_enum, and are used to identify differences in two SMILES files. They are included here because why not?

Program mol_diff_viewer takes 2 input files which are assumed to be 2 sets of molecules in the same order with the same names. It scans down the two lists and if it finds that the same name is associated with different SMILES strings then it shows that pair side by side. It's so you can see where two standardisation approaches differ, for example.

Try ./mol_diff_viewer chembl_20_first_10000.smi \ chembl_20_first_10000_std.smi

Program mol_diff_viewer2 is similar, but the 2 input files can have different numbers of SMILES strings for each molecule name. It shows side by side those molecules where the SMILES strings are different either in structure or number. It's for examining different tautomer programs or rules, for example. If one tautomer enumerator gives 2 tautomers and the other 4, the two sets of structures and SMILES strings will be shown alongside each other.

Try ./mol_diff_viewer2 chembl_20_first_10000.smi \ chembl_20_first_10000_orig.smi

Building the programs

Requires: a recent version of OEChem, a relatively recent version of Boost (1.55 and 1.60 are known to work). For mol_diff_viewer you will also need Qt (version >5.2) and Cairo. The qmake from an appropriate Qt must be in your path.

To build it, use the CMakeLists.txt file in the src directory. It requires the following environment variable to point to a relevant place:

OE_DIR - the top level of an OEChem distribution

Then cd to src and do something like: mkdir dev-build cd dev-build cmake -DCMAKE_BUILD_TYPE=DEBUG .. make

If all goes to plan, this will make a directory src/../exe_DEBUG with the executables in it. These will have debugging information in them.

For a release version: mkdir prod-build cd prod-build cmake -DCMAKE_BUILD_TYPE=RELEASE .. make

and you'll get stuff in src/../exe_RELEASE which should have full compiler optimisation applied.

By default, only taut_enum is built, so as not to inconvenience people who don't have Qt and/or Cairo installed. In order to build mol_diff_viewer, do

cmake -DCMAKE_BUILD_TYPE=DEBUG -DBUILD_GRAPHICS_PROGRAMS=ON ..

If you're not wanting to use the system-supplied Boost distribution in /usr/include then set BOOST_ROOT to point to the location of a recent (>1.48) build of the Boost libraries. On my Centos 6.5 machine, the system boost is 1.41 which isn't good enough. You will also probably need to use '-DBoost_NO_BOOST_CMAKE=TRUE' when running cmake:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DBoost_NO_BOOST_CMAKE=TRUE ..

These instructions have only been tested in Centos 6 and Ubuntu 14.04 Linux systems. I have no experience of using them on Windows or OSX, and no means of doing so.

Some notes on the program's output

The standardisation routine is not intended to produce a canonical or representative tautomer. It merely puts the input molecules into a tautomer that the enumeration routines can deal with. Thus, in the triazoles.smi case, the 4-H triazole is converted to 2-H tautomer, as is the 1-H.

When enumerating the tautomers, the program can either do the original enumeration or the extended, it can't do both at the same time. If you want to do both, you need to run the output of one into the input of the other. This can be done either using an intermediate file, or, by taking advantage of the OEChem convention that a filename that is just an extension forces reading or writing to stdout/stdin, by piping the output of one into the input of the other, as shown in run_taut_enum.sh.

References

(1) Kenny PW, Sadowski, J (2005). Structure modification in chemical databases. Methods and principles in medicinal chemistry. In: Oprea T (ed) Chemoinformatics in drug discovery. 23:271-285. (2) SMIRKS Theory Manual: http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html.

David Cosgrove AstraZeneca 26th January 2016

davidacosgroveaz@gmail.com