epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
314 stars 102 forks source link

PostgreSQL:Efficient way to find unique product from reaction. #534

Closed sitanshubhunia closed 2 years ago

sitanshubhunia commented 2 years ago

I have successfully installed cartridge in PostgteSQL ( Bingo 1.9.1 in Windows).

For testing I have load sdf file as well as SMILES file into a table and create index on that table. Try substructure search it's working fine. Also similarity search working fine. Previously I have faced some issue ( ref #527 )that resolved .

Now I have some questions before implemented it in our project.

Q1. After creation of index using bingo_idx on molecule column I have seen two new tables are created. There I have seen a column named "xyz". What is the purpose of this? If this column is necessary, then how to use this?

Q2. Before smiliraty search is it required to generate Fingerprint? If yes then how to use please give a example or pseudo code.

Q3. In our project we have more than 1 Million reactions in reaction CML format. Some of which are duplicate reactions. How I efficiently searched duplicate reactions ? And how I segregated the unique products that we have presently ?

Thanks in advance

AlexanderSavelyev commented 2 years ago

Hi @sitanshubhunia Q1: additional tables are created to perform quick exact, mw and gross search operations. Please do not touch or update the tables, it is removed automatically after bingo index is dropped. User should not use the tables, but the tables are used by index operations (e.g. while running @ ('$query', '$parameters')::bingo.exact)

AlexanderSavelyev commented 2 years ago

Q2. if you want to use similarity search, just create and index for a column create index $index on $table using bingo_idx ($column bingo.molecule) select * from $table where $column @ ($bottom, $top, '$query', '$metric')::bingo.sim

One can run explain command to ensure that index is used e.g. explain select a from btest where a @ (0, 0.5, 'Cn1c(=O)c(=O)c2cc(C(S)=N)ccc12', '')::bingo.sim Index Scan using btest_idx on btest (cost=0.00..5.01 rows=1 width=39) Index Cond: (a @ '(0,0.5,"Cn1c(=O)c(=O)c2cc(C(S)=N)ccc12","")'::bingo.sim)

AlexanderSavelyev commented 2 years ago

Q3. it depends on your duplicates definition. There can be different cases for comparing two reactions to verify identity, e.g. mapping, stereo, charges, tautomers, etc One can use different identifier representation for a reaction, e.g. canonical smiles, inchi, hash (fingerprints) strings, etc and then take identifier string as a duplicate check and/or aggregation (one need to create a default text index then) - it is one of possible ways

sitanshubhunia commented 2 years ago

Hi @AlexanderSavelyev Thank you for clarifying all my queries.

Answer related to Q1, I understand what you wants to say. But what is "xyz" column [ which is created by cartridge itself for search]

What is the purpose of this column "xyz"? If this column is necessary, then how to use this?

index