ggasoftware / indigo

Indigo: a cheminformatics toolkit. Bingo: RDBMS data cartridge for Oracle, MS SQL Server, and PostgreSQL
https://lifescience.opensource.epam.com/indigo
47 stars 15 forks source link

Postgres cartridge - confusing similarity search results #7

Open zero323 opened 10 years ago

zero323 commented 10 years ago
foo=# SELECT bingo.getversion() ;
   getversion    
-----------------
 1.7.9.0 linux64
(1 row)

foo=# SELECT version();
                                           version                                            
----------------------------------------------------------------------------------------------
 PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.8.1-6) 4.8.1, 64-bit
(1 row)

Input data:

We get the same inchi string using smiles from both sources:

foo=# SELECT bingo.inchi('CN1C=NC2=C1C(=O)N(C)C(=O)N2C', '') = bingo.inchi('Cn1cnc2c1c(=O)n(C)c(=O)n2C', '');
 ?column? 
----------
 t
(1 row)

Exact search with 'MAS' option treats both representations as identical.

foo=# SELECT 'Cn1cnc2c1c(=O)n(C)c(=O)n2C' @  ('CN1C=NC2=C1C(=O)N(C)C(=O)N2C', 'MAS') :: bingo.exact;
 ?column? 
----------
 t
(1 row)

But when we try similarity search we get extremely low Tanimoto Coefficient.

foo=# SELECT bingo.getsimilarity('Cn1cnc2c1c(=O)n(C)c(=O)n2C', 'CN1C=NC2=C1C(=O)N(C)C(=O)N2C', 'tanimoto');
 getsimilarity 
---------------
       0.21875
(1 row)

I assume it is due the way of handling aromaticity:

from indigo import *
from indigo_renderer import *

indigo = Indigo()
renderer = IndigoRenderer(indigo)
indigo.setOption("render-output-format", "png")
indigo.setOption("render-image-size", 200, 250);
indigo.setOption("render-background-color", 1.0, 1.0, 1.0);

m1 = indigo.loadMolecule('CN1C=NC2=C1C(=O)N(C)C(=O)N2C')
renderer.renderToFile(m1, "caffeine_m1.png");

caffeine_m1.png

m2 = indigo.loadMolecule('Cn1cnc2c1c(=O)n(C)c(=O)n2C')
renderer.renderToFile(m2, "caffeine_m2.png");

caffeine_m2.png