coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

Y_pred.npy file #16

Closed albertma-evotec closed 2 years ago

albertma-evotec commented 2 years ago

Hi I was using molpal for a retrospective docking study. The objective configuration is to look up the already-known docking scores. image I am trying to understand the output files. I found the Y_pred.npy file is a numpy array of float point numbers. Its size is the same as my molecular library. Are these numbers the values reflecting how 'good' the corresponding compounds are so molpal will select them for next iteration exploration? or are they simply predicted docking scores by the RF regression model? And does the order of these number follow the order of the compounds in the library file?

Below is my config file: image

Many thanks

davidegraff commented 2 years ago

These are the predicted means (end uncertainties, if needed) of your surrogate model, they’re used for checkpointing and retrospective analysis. The array is parallel to the molecules in your library.

albertma-evotec commented 2 years ago

These are the predicted means (end uncertainties, if needed) of your surrogate model, they’re used for checkpointing and retrospective analysis. The array is parallel to the molecules in your library.

Can I say that the higher the number is, the more likely the corresponding compound being having a better (more negative) docking score? Or do these 'predicted means' not necessarily correlate to the docking scores?

I ever tried to plot these numbers (at the final iteration) against the true docking scores. I did not see much correlation.

davidegraff commented 2 years ago

The question of surrogate model accuracy is related but not strictly similar to optimization performance. You can make whatever claims/analyses you want, we’re just giving you the information. In a greedy optimization, the most important model-based metric is rank correlation because new points are prioritized solely based on predicted mean