coleygroup / molpal

active learning for accelerated high-throughput virtual screening
MIT License
159 stars 36 forks source link

Suggest_next_molecule #1

Closed lcollia closed 3 years ago

lcollia commented 3 years ago

Dear, Great work. Do you have a function to suggest the next molecule to test based on the surrogate model that is created? I mean not scoring an existing list of potential molecules (your "library"), but generating the fingerprint of the best next molecule to test according to the acquisition function?

As example, the "suggest_next_locations" function in a similar library for Bayesian optimization GpyOpt.

thanks, Lionel

davidegraff commented 3 years ago

Hey Lionel,

thanks for the interest in our work here!

you raise a good point and one that we thought of while developing MolPAL given the software's similarity to existing Bayesian optimization libraries. However, we decided against it for two reasons:

  1. very strictly, we are looking at pool-based optimization, so the idea was out of scope for the project.
  2. the nature of the problem, optimization in chemical space, is a weird one. There's no widely agreed-upon method by which to represent molecules. They're discrete, but weirdly so. Suggesting molecules based on our trained model would have necessitated we wade into that research, which again, is wholly out of scope for this project. Moreover, we predict molecular properties using two broad classes of models: RF and NN models that learn from a fixed molecular representation calculated from the graph (fingerprint) and a message-passing model that learns a task-specific fingerprint. In the latter case, that would require us to then have an auxiliary module that learns how to construct molecules based on this task-specific fingerprint (a more difficult task than the "simple" reconstruction of molecules based on their fixed fingerprint.)

Previous projects have done something similar to what you're suggesting. See ACS Cent. Sci. 2018, 4, 2, 268–276 for a good example.

I'm happy to talk more about this, and it's certainly a possible future project, but we have no plans to implement something like that into MolPAL

best, david

lcollia commented 3 years ago

Hi David, Thank you for your detailed answer, I appreciate. Molecular Fingerprint is clearly not the best way to do optimization, as the decoding of a fingerprint to a molecule is not straightforward. But, one could image that with another descriptor space (like the ones coming from auto-encoder), it could be possible. Best, Lionel

davidegraff commented 3 years ago

To be clear, we're not optimizing in fingerprint-space. Rather, we're optimizing directly in structure-space by having a fully enumerated, discrete optimization domain via the virtual library/MoleculePool. Treating the problem like this necessitates that we have to predict the objective function value of every single point in our domain, which we know to be "inefficient," but it also allows us to sidestep the research question of "how do you accurately represent molecules?" That's an ongoing challenge in the field and not something that we were interested in addressing in this work. In principle, if you could devise an accurate descriptor that is unique and fully invertible for every molecule, then you can perform molecular optimization using standard Bayesian optimization libraries. Works like the VAE that I mentioned above have their own challenges associated with them (notably, synthesizability as a big one,) so that was another reason why we stuck to this problem formulation.