coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

[BUG]: #39

Open coparks2012 opened 2 years ago

coparks2012 commented 2 years ago

Describe the bug I am currently unable to reproduce the top-k SMILES metric MPN model results from the MolPAL paper using the code in this github repo.

Tables S1-S3 list the following %s: 59.6 +/- 2.3 (10k library), 68.8 +/- 1.0 (50k) library. My current runs repeatedly yield worse metrics: ~44 +/- 3.7 (10k library), and 64.8 +/- 1.7 (50k library). I haven't been able to run the HTS library yet, due to excessive compute time. However, these metrics differ noticeably from the manuscript.

I have confirmed that the molpal library/my conda environment reproduce the reported RF and NN results. This makes me suspect there could be something new to the current molpal MPN code driving the issue.

Example(s) I have created shell scripts to run molpal using the MPN model over five runs, as performed in the paper. An example is shown below for the 10k library. An analogous script launches the runs using the 50k library.

rm -r ./molpal_10k/
rm -r ./our_runs/
mkdir our_runs

python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run1/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run2/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run3/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run4/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run5/

I am currently parsing the results using the fraction of top-k SMILES identified metric. Below is the script I am using to parse results.

import pandas as pd
import numpy as np

def compare(df1, df2):
    s1, s2 = list(df1.smiles), list(df2.smiles)[:100]
    dups = list( set(s1) & set(s2) )
    print('number of top scoring compounds found:', len(dups) )
    print('total number of ligands explored:', len(s1), len(set(list(s1) )) )
    print('number of unique ligands recorded in top:', len( list(set(s2))))
    print(' ')
    print(' ')
    return len(dups)

dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')

files = ['./our_runs/run2/data/top_636_explored_iter_5.csv',\
         './our_runs/run3/data/top_636_explored_iter_5.csv', \
         './our_runs/run4/data/top_636_explored_iter_5.csv', \
         './our_runs/run5/data/top_636_explored_iter_5.csv',\
         './our_runs/run6/data/top_636_explored_iter_5.csv']

num=[]
for ff in files:
    df1 = pd.read_csv(ff,index_col=None)
    num+= [compare(df1, dfm)/100.]

num = np.array(num)
print("statistics:", np.mean(num), np.std(num) )

Here are some example outputs I have obtained

number of top scoring compounds found: 39
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 43
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

statistics: 0.44400000000000006 0.035552777669262355

I see a similar drop in performance for the 50k library, although closer to the reported stats in the paper.

number of top scoring compounds found: 334
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 323
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 325
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 309
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 328
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

statistics: 0.6476 0.016560193235587575

As mentioned, the code currently reproduces the RF and NN results. For example, I show the output from the 50k library when using the NN model.

number of top scoring compounds found: 347
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 341
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 346
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 361
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

number of top scoring compounds found: 352
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500

statistics: 0.6988 0.01354104870384859

This matches the 70% reported in the paper almost exactly, whereas the MPN model is off by a statistically significant amount.

Expected behavior My current understanding of the Molpal library is that the MPN runs should reproduce the statistics reported in tables S1-S3 of the paper.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Checklist

_____________________________________________________________________________________________________________ ERROR collecting tests/test_acquirer.py ______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_acquirer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_acquirer.py:8: in <module>
    from molpal.acquirer import Acquirer
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
____________________________________________________________________________________________________________ ERROR collecting tests/test_featurizer.py _____________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_featurizer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_featurizer.py:6: in <module>
    from rdkit import Chem
E   ModuleNotFoundError: No module named 'rdkit'
______________________________________________________________________________________________________________ ERROR collecting tests/test_lookup.py _______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_lookup.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_lookup.py:10: in <module>
    from molpal.objectives.lookup import LookupObjective
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
_______________________________________________________________________________________________________________ ERROR collecting tests/test_pool.py ________________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_pool.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_pool.py:4: in <module>
    from molpal.pools.base import MoleculePool
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'

Additional context Add any other context about the problem here.

davidegraff commented 2 years ago

Thanks for the detailed problem description! I believe your issue is in your compare function. Specifically, you only take the first 100 SMILES of the true dataframe, but that CSV of the true scores isn’t sorted so those first 100 SMILES aren’t actually the top-100. Can you confirm this to be the case?

coparks2012 commented 2 years ago

Thank you for the fast response, David. Unfortunately, the sort did not fix the issue. Interestingly, the library isn't ranked as you mentioned, but is pretty close. As my RF and NN results were pretty close to the paper, despite not having sorted the library, I checked the kendalltau.

import pandas as pd
import os
import numpy as np
from scipy.stats import kendalltau

dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')
dfm['index1'] = dfm.index

#====sort library by score. scores are negative, so ascending is OK.                                                                                                                                                                                      
dfm.sort_values(by=['score'],inplace=True, ignore_index=True)
dfm['index2'] = dfm.index
scores1 = np.array(list(dfm.index1))
scores2 = np.array(list(dfm.index2))

print('how well is the originall library ranked:', kendalltau(scores1,scores2))

how well is the original library ranked: KendalltauResult(correlation=0.9525646115743105, pvalue=0.0)

With a 0.95 KDT, the ranking is pretty close!

I have been doing some digging into the mpnmodel.py code, and have noticed two things of potential relevance: (1) the training dataloader is not being shuffled, and (2) model checkpointing is not being used. I think (2) could be highly relevant for the smaller 10k/50k datasets. Here, the initial training dataset sizes are very small. I suspect training could be highly volatile - it is quite likely the best epoch could be early on in training, but the model is currently using the epoch 50 weights, regardless if the loss is shooting up. I believe the molpal paper reports having used model checkpointing to produce its results, but this isn't implemented in the current code.

I tested the MPN molpal implementation as follows.

1) As is. I cloned the github repo, installed the conda environment, and performed the following

rm -r ./molpal_10k/
rm -r ./our_runs/
mkdir our_runs

python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run1/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run2/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run3/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run4/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run5/

I then post-processed the results with the following code (note that this implements the sort you mentioned).

import pandas as pd
import os
import numpy as np

def compare(df1, df2):
    s1, s2 = list(df1.smiles), list(df2.smiles)[:100]
    dups = list( set(s1) & set(s2) )
    print('number of top scoring compounds found:', len(dups) )
    print('total number of ligands explored:', len(s1), len(set(list(s1) )) )
    print('number of unique ligands recorded in top:', len( list(set(s2))))
    print(' ')
    print(' ')
    return len(dups)

dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')
dfm.sort_values(by=['score'],inplace=True)

files = ['./our_runs/run1/data/top_636_explored_iter_5.csv',\
         './our_runs/run2/data/top_636_explored_iter_5.csv', \
         './our_runs/run3/data/top_636_explored_iter_5.csv', \
         './our_runs/run4/data/top_636_explored_iter_5.csv',\
         './our_runs/run5/data/top_636_explored_iter_5.csv']

num=[]
for ff in files:
    df1 = pd.read_csv(ff,index_col=None)
    num+= [compare(df1, dfm)/100.]

num = np.array(num)
print("statistics:", np.mean(num), np.std(num) )

This yields the following result which is significantly below the 59% reported in the paper.

number of top scoring compounds found: 51
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 40
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 41
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

statistics: 0.454 0.04498888751680797
  1. with shuffling and checkpointing implemented. To test if shuffling and model checkpointing could be impacting the results, I needed to modify code in a couple of different places to get the folder name to the correct class, but the most important lines are here, and are implemented in the train function of the MPNN class in mpnmodel.py

      train_dataloader = MoleculeDataLoader(
            train_data, self.batch_size, self.ncpu, **shuffle=True**, pin_memory=False
        )

        callbacks = [
            EarlyStopping("val_loss", patience=10, mode="min"),mpnn.ptl.EpochAndStepProgressBar(),**ModelCheckpoint( monitor='val_loss',dirpath = self.p2_chkpt,filename='best_model')**  ]

        trainer = PlTrainer(
            logger=False,
            max_epochs=self.epochs,
            callbacks=callbacks,
            gpus=1 if self.use_gpu else 0,
            precision=self.precision,
            enable_model_summary=False,
            **enable_checkpointing=True**,                                                                                                                                                                      
        )

I then reran the original experiment (running molpal five times, and gathering stats). This produces the following results.

number of top scoring compounds found: 56
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 53
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 59
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 52
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 48
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

statistics: 0.536 0.03720215047547655

To be double sure, I reran the experiment, except this time running molpal with shuffling/checkpointing ten times. This produced the following results.

number of top scoring compounds found: 54
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 57
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 55
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 58
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 48
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 55
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 56
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

number of top scoring compounds found: 52
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100

statistics: 0.5388888888888889 0.031071403231354582

These results appear to improve upon the current version of molpal. However, they are still a bit below the numbers reported in the paper for the 10k screen. There could be other issues lurking as well.

David, would you be up for cloning the current molpal repo, running the mpn model as is against the 10k library, and letting me know if you get the same stats as I reported here (~0.45)?

davidegraff commented 2 years ago

What's are the results when you use the "scores" metric? There could be a bug, but one of the instabilities of the SMILES metric is that it's sensitive to data sorting. The 82nd through 114th scores in the true 10k dataset are all -9.5, meaning that there are nCr(33, 18) ways to get 100% performance on this metric. Of course, at analysis time, you can only evaluate one of these, and you might very well pick a set of 100 SMILES strings that don't line up with what you've acquired in a given run. I can't comment on how the pandas sorting is working relative to the original analysis scripts, but your analysis pipeline could very well cause a scenario that makes these numbers are an apples-to-oranges comparison. That's why we went with the somewhat unexplainable "scores" metric for our results. Similarly, there are nCr(33, 18) ways to get 100% performance, but the analysis is invariant to which specific set MolPAL identifies at runtime.

To summarize, there could very well be a bug in the code but:

  1. see if the same performance degradation holds using the "scores" metric
  2. perform your analysis using the original analysis scripts as directed in the publication tag
coparks2012 commented 2 years ago

Thank you for your response, David. I hadn't considered the variations in the sort.

I have confirmed that this does not seem to be driving the issue. I did a fresh run over the weekend using the MPN model, a fresh clone of the github repo, and the 10k, 50k, and HTS dataset. The smaller datasets seem to have issues, with the larger set agreeing well with the paper. This is the same trend I observed previously.

I put together the following script to parse the results using the paper method, as you suggested. Analogous scripts were used for the 50k and HTS dataset.

from pathlib import Path
import sys

from matplotlib import pyplot as plt, ticker
from matplotlib.axes import Axes
import numpy as np
from scripts.experiment import Experiment
from scripts.utils import build_true_dict, extract_smis

SIZE = 2.1e6
k = 100

d_smi_score = build_true_dict("./data/Enamine10k_scores.csv.gz")
smis = extract_smis("./libraries/Enamine10k.csv.gz")
y_true = np.array([d_smi_score.get(smi) for smi in smis], float)
hts_top_k = sorted(d_smi_score.items(), key=lambda kv: kv[1], reverse=True)[:k]

HTS_SB_DIR = Path( "/EBS/MOLPAL_repro/molpal_mpn_10k/10K_AL_ROOT/0.01")
rs = []
for e in (HTS_SB_DIR).iterdir():
    e = Experiment(e)
    r = [e.calculate_reward(i, hts_top_k)[2] for i in range(e.num_iters)][-1]
    rs.append(r)

R = np.array(rs)
print("SCORES METRIC:", np.mean(R), np.std(R) )

HTS_SB_DIR = Path( "/EBS/MOLPAL_repro/molpal_mpn_10k/10K_AL_ROOT/0.01")
rs = []
for e in (HTS_SB_DIR).iterdir():
    e = Experiment(e)
    r = [e.calculate_reward(i, hts_top_k)[1] for i in range(e.num_iters)][-1]
    rs.append(r)

R = np.array(rs)
print("SMILES METRIC:", np.mean(R), np.std(R) )

Here are the results as a function of the dataset using the current github repo code as is:

10K dataset
SCORES METRIC: 0.542 0.0406939798987516
SMILES METRIC: 0.45599999999999996 0.026532998322843202

50K dataset
SCORES METRIC: 0.7023999999999999 0.005276362383309172
SMILES METRIC: 0.6612 0.004995998398718723

HTS dataset (averaged over two runs - others runs in progress)
SCORES METRIC: 0.9744999999999999 0.0005000000000000004
SMILES METRIC: 0.947 0.0

In general, the statistics match results in the paper better as the size of the dataset increases. For the 10k/50k datasets, the metrics consistently underperform.

I will report back with the RF and NN analysis. My prior runs suggest those numbers reproduce the paper well.