epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
292 stars 100 forks source link

BingoNoSQL produces different similarity scores for the same molecule #257

Open twall opened 3 years ago

twall commented 3 years ago

I've got an input SMILES string for which bingo produces different similarity results depending on whether the indexed molecule is provided as inchi or SMILES. I need to have the similarity results be consistent regardless of the initial encoding of the indexed molecule.

Input:

SMILES (a): CC(C)(C)OC(=O)N(CC1=CN=C(C=C1)OC)C2=NC(=C(C=C2)C=O)F 

Indexed molecule (as SMILES)

SMILES (b): CC1=C(C=NC=C1C2=C(C(=C3C=NC(=CC3=C2)NC(=O)OC4CCOCC4)N)F)N

Indexed molecule (as InChI)

InChI=1S/C21H22FN5O3/c1-11-15(8-25-10-17(11)23)14-6-12-7-18(26-9-16(12)20(24)19(14)22)27-21(28)30-13-2-4-29-5-3-13/h6-10,13H,2-5,23-24H2,1H3,(H,26,27,28)
SMILES (c), from mol.smiles() on the above molecule: Cc1c(N)c[n]cc1-c1c(F)c(N)c2c(cc(N=C(OC3CCOCC3)O)[n]c2)c1 |w:20,r|

Note that indigo will report the two molecules as equal if the objects are created from SMILES b and c.

indigo = indigo.Indigo()  # see options used below
smiles_a = (from above)
smiles_b = (from above)
inchi = (from above)

mol = indigo.loadMolecule(smiles_a)
mol.standardize()
mol2 = indigo.loadMolecule(smiles_b)
mol2.standardize()
mol3 = indigo.loadMolecule(inchi)
mol3.standardize()
db = bingo.Bingo.createDatabaseFile(indigo, 'bingodb', 'molecule', '')
db.insert(mol2, 2)
db.insert(mol3, 3)
matcher = db.searchSim(mol, 0, 1, 'tanimoto')
results = []
matcher.next()
results.append((matcher.getCurrentId(), matcher.getCurrentSimilarityValue()))
matcher.next()
results.append((matcher.getCurrentId(), matcher.getCurrentSimilarityValue()))

The similarity for the smiles-based molecule is about 0.56. The similarity for the inchi-based molecule is about 0.38.

I have several other molecules for which the result is significantly different whether the molecule is created from inchi or from SMILES.

Indigo options:

"ignore-stereochemistry-errors": True,
"standardize-charges": True,
"standardize-keep-largest": True,
"ignore-closing-bond-direction-mismatch": True,
"ignore-bad-valence": True,
"standardize-stereo": True,
"standardize-neutralize-zwitterions": True,
"standardize-clear-unusual-valences": True,
"simulation-type", "sim",
twall commented 2 years ago

There may also be additional discrepancies depending on whether the starting molecule was created via SMILES or InChI:

source CC[C@@]1(C[C@H](C1)[C@](O)(c1ccc(Cl)[n]c1Cl)c1cc([n]c(N[C@@H](C)C(F)(F)F)[n]1)C(F)(F)F)NS source InChI InChI=1S/C20H21Cl2F6N5OS/c1-3-17(33-35)7-10(8-17)18(34,11-4-5-14(21)32-15(11)22)12-6-13(20(26,27)28)31-16(30-12)29-9(2)19(23,24)25/h4-6,9-10,33-35H,3,7-8H2,1-2H3,(H,29,30,31)/t9-,10-,17+,18-/m0/s1 target C[C@@H](Nc1nc(N[C@H](C)C(F)(F)F)nc(-c2cccc(Cl)n2)n1)C(F)(F)F target InChI InChI=1S/C14H13ClF6N6/c1-6(13(16,17)18)22-11-25-10(8-4-3-5-9(15)24-8)26-12(27-11)23-7(2)14(19,20)21/h3-7H,1-2H3,(H2,22,23,25,26,27)/t6-,7-/m1/s1

score input/bingodb encoding
0.30 inchi/inchi
0.27 smiles/smiles
0.14 smiles/inchi
0.12 inchi/smiles

And when a molecule C[C@@H](Nc1nc(N[C@H](C)C(F)(F)F)nc(-c2cccc(Cl)n2)n1)C(F)(F)F is compared against itself:

1.00 smiles/smiles
1.00 inchi/inchi
0.33 inchi/smiles
0.33 smiles/inchi