epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
315 stars 105 forks source link

Bingo NoSQL incorrectly reports a similarity score of "1" #204

Open twall opened 3 years ago

twall commented 3 years ago

Using python epam.indigo 1.4.0b0

I've built a bingo NoSQL DB which is returning a similarity score of "1" (the max) when comparing the following different molecules (the first inchi is given as search input; both have been entered into the DB).

InChI=1S/C17H25N3O2/c18-9-14-2-1-3-20(14)15(21)10-19-16-5-12-4-13(6-16)8-17(22,7-12)11-16/h12-14,19,22H,1-8,10-11H2/t12?,13?,14-,16?,17?/m0/s1

Vildagliptin

Matching molecule:

InChI=1S/C18H24N4O2/c19-8-14-1-2-15(9-20)22(14)16(23)10-21-17-4-12-3-13(5-17)7-18(24,6-12)11-17/h12-15,21,24H,1-7,10-11H2/t12?,13?,14-,15+,17?,18?

CHEMBL207912

Bingo responds with a similarity score of 1.0, which is obviously not correct (assuming "1" means an exact match). I would expect that extra triple-bonded nitrogen to have some downward impact on the score.

Here is the code which constructs the DB:

db = bingo.Bingo.createDatabaseFile(indigo, dbfile, 'molecule', '')

mol = indigo.loadMolecule(inchi)
try:
    mol.standardize()
except Exception as e2:
    pass
db.insert(mol, index_key)

Here is the code which does the similarity search:

simhits = []
indigo = get_indigo()
bb = bingo.Bingo.loadDatabaseFile(indigo, db_path)
try:
    m = indigo.loadMolecule(inchi)
    matcher = bb.searchSim(m, tanimoto_min, tanimoto_max, 'tanimoto')
    while matcher.next():
        simhits.append((matcher.getCurrentId(), matcher.getCurrentSimilarityValue()))
    matcher.close()
    bb.close()
except IndigoException as e:
    logger.error(f"Can't calculate similarities on molecule '{q}' ({e})")
return simhits
mkviatkovskii commented 3 years ago

Dear @twall Thank for for the bug report, I have reproduced the problem and plan to investigate it soon.

As a workaround, you now can use non-default fingerprint types by setting option:

indigo.setOption("similarity-type", sim_type)
bingo = Bingo.createDatabaseFile(indigo, dbPath, 'molecule', '')

Where sim_type is one of:

Unfortunately you have to rebuild the database to update the fingerprints.

twall commented 3 years ago

@mkviatkovskii thank you, is there documentation available for the sim_type options?

twall commented 3 years ago

@mkviatkovskii I've regenerated the bingo db using the "chem" similarity type, and still get false "1.0" matches.

In this case,

Aspirin: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) Aspirin o-Formylphenoxyacetic acid InChI=1S/C9H8O4/c10-5-7-3-1-2-4-8(7)13-6-9(11)12/h1-5H,6H2,(H,11,12) O-Acetyl-p-hydroxybenzoic acid InChI=1S/C9H8O4/c1-6(10)13-8-4-2-7(3-5-8)9(11)12/h2-5H,1H3,(H,11,12)

While the last example differs in the position of the bonds on the aromatic, I still wouldn't consider that an exact match. If by definition such differences are considered to be ignored, the Bingo documentation should make that clear, either in the description of the similarity methods or fingerprint descriptions.

mkviatkovskii commented 2 years ago

Similarity score of 1.0 does not necessarily mean the exact match, it only means that all hashes of all sub-chains collided. So it's a case were we should consider improving fingerprints calculation, but not a bug.