coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

test on in-house data, result is bad #22

Closed likun1212 closed 2 years ago

likun1212 commented 2 years ago

hi, thanks for your work.

i have used some of our in-house docking data to test molpal. our data contains ~200k molecular and its corresponding affinity.

the result its rather bad, compare to "real" docking data, only 168 molecular appear in the top 2k, namely recovery rate = 8.4%.

below is the test config file:

############################################# path = results/greedy_top001_win10_delta01 window-size = 10 delta = 0.1 max-iters = 10 budget = 1.0 write-final = True write-intermediate = True retrain-from-scratch = True ncpu = 1 fingerprint = pair radius = 2 length = 2048 pool = eager libraries = [/library/viva_196370_smiles.csv] delimiter = , fps = library/viva_196370_smiles.h5 invalid-idxs = [] metric = greedy init-size = 0.01 batch-sizes = [0.01] objective = lookup minimize = True objective-config = /objective/viva20w_lookup.ini model = rf n-estimators = 100 max-depth = 8 min-samples-leaf = 1 precision = 32 top-k = 0.01 ######################################################

where i did it wrongly? any feedback would be appreciated!

davidegraff commented 2 years ago

The config file looks reasonable to me. What’s your objective config file look like? There’s also no theoretical guarantee of performance with MolPAL. It’s possible that your optimization surface is too rough for an RF to optimize effectively over

likun1212 commented 2 years ago

thanks

objective config ############################ path = data/viva_196370_smiles_affi-vina_4idv.csv smiles-col = 0 score-col = 1 ###################################

in addition, my ground truth docking data contains a lot molecular that have big positive affinity value. will this effect the result? Figure_1

likun1212 commented 2 years ago

i did a test.

remove all positive affinity from data, this makes number of data points from 196370 drop to 155877. the recorvey rate of top 2k change to 34%(was 8.4%)

how so?

davidegraff commented 2 years ago

Are those large positive scores true output from vina or are they just placeholder values? I’ve never seen docking scores like those in all the millions of calculations I’ve run. In any case, think about how those scores are affecting the training of your surrogate model

likun1212 commented 2 years ago

thanks. at least, i now know what should to do next.

feel free to close this issue