IBM / Hestia-OOD

Independent evaluation set construction for trustworthy ML models in biochemistry
https://ibm.github.io/Hestia-OOD/
MIT License
7 stars 1 forks source link

Calculate similarity issue when query and target dataframes are not the same #29

Closed rahulnair23 closed 1 month ago

rahulnair23 commented 1 month ago

Describe the bug When the target and query dataframe are not the same, the result is inconsistent in terms of indices for input elements.

To Reproduce Steps to reproduce the behavior:

import pandas as pd
from hestia.similarity import calculate_similarity

smiles = ['[H][C]1=[N][C]2=[C]([O][C]([H])([H])[C]3([H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[C]3([H])[H])[N]=[C]([N]([H])[C]3=[C]([H])[C]([H])=[C]([H])[C]([Br])=[C]3[H])[N]=[C]2[N]1[H]', '[H][C]1=[N][C]2=[C]([O][C]([H])([H])[C]3([H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[C]([H])([H])[C]3([H])[H])[N]=[C]([N]([H])[C]3=[C]([H])[C]([H])=[C]([H])[C]([H])=[C]3[H])[N]=[C]2[N]1[H]', '[H]c1c(c(c(c(c1[H])Cl)[H])N([H])c2nc3c(c(n2)OC([H])([H])C4(C(C(C(C(C4([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])N=C(N3[H])[H])[H]']

query_df = pd.DataFrame({'smiles': smiles})
target_df = pd.DataFrame({'smiles': smiles[0:2]})
sim_df = calculate_similarity(query_df, target_df, data_type='small_molecule', similarity_metric='fingerprint', field_name='smiles')
print(f"Max index: Query: {sim_df['query'].max()}, Target: {sim_df.target.max()}. ")
``
returns

Max index: Query: 2, Target: 2.


**Expected behavior**
Should return

Max index: Query: 2, Target: 1.



**Desktop (please complete the following information):**
 - OS: [e.g. iOS] MacOS