epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
291 stars 100 forks source link

Add documentation on using external fingerprints in indigo/bingo #408

Open twall opened 3 years ago

twall commented 3 years ago

Indigo.loadFingerprintFromDescriptors() and Indigo.loadFingerprintFromBuffer() exist, as does Bingo.insertWithExtFP(), but the documentation on how to use these to implement a bingo DB with external fingerprints is lacking.

I have some vectors of measurements thatt I'd like to convert into fingerprints in order to perform similarity lookup using bingo, but I can't determine exactly how (even after perusing the source for a while).

descriptors = ?
fp = indigo.loadFingerprintFromDescriptors(descriptors, ?, ?)
bingodb.insertWithExtFP(?, fp)

What should be where the question marks are? What's the reasoning/process in converting an array of (normalized) floats into a vector of bits?

twall commented 3 years ago

Attempting to create a db like thus (the fingerprint length is about 1600):

inchi = ...
ext_fp = [...]
db = bingo.Bingo.createDatabaseFile("foo")
indigo = Indigo()
mol = indigo.loadMolecule(inchi)
sim_fp = indigo.loadFingerprintFromBuffer(ext_fp)
db.insertWithExtFP(mol, sim_fp)

Results in the following error:

indigo.bingo.BingoException: 'BaseSimilarityMatcher: external fingerprint is incompatible with current database'
IuriiPuzanov commented 3 years ago

Hi Timothy, Look please into few tests with external fingerprints functionality (in attachment) I hope it helps you with usage such fingerprints for your tasks.

fp-from-descriptors.py.txt bingo_settings.py.txt ext_fp.py.txt

Be sure please that fingerprints settings should be the same for indigo and bingo instances.

Best Regards! Iurii

twall commented 3 years ago

Hi @IuriiPuzanov ,

Thanks for the information. I have a few questions.

IuriiPuzanov commented 3 years ago

Hi Timothy,

In the test the usual fingerprint size is used and in this case the only requirement is that fingerprint size should be the same for all molecules in the database. LoadFingerprintFromBuffer uses the buffer with fingerprint itself so this an array bits packed into bytes. And it would be better to normalize the descriptors between 0.0 and 1.0 to provide more predictable results, the test uses values outside this range only for checking robustness of algorithm in special cases.

Best Regards! Iurii

twall commented 3 years ago

@IuriiPuzanov

Thank you for your response.

IuriiPuzanov commented 3 years ago

Hi Timothy,

Actually fp_density is used just for output actual density values for generated fingerprints. Using this parameter with descriptors depends on actual descriptors values and desirable sensitivity. In most cases the value 0.5 is acceptable. As about relation between sim_qwords and fingerprint size you are absolutely right, and fingerprint size has no direct dependency on the length of descriptors array. In any case all descriptors will be packed or scattered through available fingerprint bits but fingerprint size should be large enough for desirable sensitivity.

Best Regards! Iurii

twall commented 2 months ago

Finally getting back to this, it seems the ability to successfully invoke db.insertWithExtFP() depends on the fingerprint size in bytes; a value of 64 works, but other values produce an error indigo.bingo.bingo_exception.BingoException: insert fail: external fingerprint is incompatible with current database. What are the constraints on the fingerprint size in bytes? Is this simply specifying how big a structure to use within the database, and has nothing to do with the length of the fingerprint descriptors?

twall commented 2 months ago

I'm using rdkit to generate MACCs fingerprints like the following:

from rdkit import Chem
from rdkit.Chem import MACCSkeys
rmol = Chem.MolFromSmiles(mol.smiles())
maccs_fp = MACCSkeys.GenMACCSKeys(rmol)
bs = list(int(s) for s in maccs_fp.ToBitString())
fp = indigo.loadFingerprintFromDescriptors(bs, 64, 0.5)

maccs_fp.ToBitString() generates an array of zeroes and ones, so while it seems to work, it seems to be less information than loadFingerprintFromDescriptors expects (array of floats between zero and one).

Is this the correct way to load an external fingerprint, or am I missing something?

IuriiPuzanov commented 2 months ago

Hi Timothy,

It looks like you need just convert the ints into floats and provide this array as input into loadFingerprintFromDescriptors.

Best Regards! Iurii