AIML-K / GNN_Survey

updating papers related to GNN
0 stars 0 forks source link

Graph neural bayesian optimization for virtual screening #3

Open 2nazero opened 1 week ago

2nazero commented 1 week ago

Graph neural bayesian optimization for virtual screening

@inproceedings{wang2023graph,
  title={Graph neural bayesian optimization for virtual screening},
  author={Wang-Henderson, Miles and Soyuer, Bartu and Kassraie, Parnian and Krause, Andreas and Bogunovic, Ilija},
  booktitle={NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World},
  year={2023}
}
wonsutak-ysz commented 2 days ago

Graph Neural Bayesian Optimization for Virtual Screening

Introduction

This paper addresses the challenge of virtual screening in drug and materials discovery. Virtual screening is a computational technique used to identify promising candidate molecules that may possess desired properties, such as binding to a drug target or exhibiting specific quantum mechanical properties.

Two main challenges exist:

The vast size of chemical libraries: With the number of accessible molecules reaching billions or even trillions, a brute-force search is computationally infeasible.The high cost of evaluating properties: Computational simulations or experiments to determine these properties can be prohibitively expensive. The solution proposed is Bayesian Optimization ,Use GNNs to represent molecules.

Problem Setting

The goal of this approach is to solve optimization problems related to drug or material design. More specifically, the aim is to optimize an unknown objective function (for instance, maximizing binding affinity to a drug target) while querying a molecule's property evaluation function as few times as possible.

Bayesian Optimization with Graphs

A molecule is represented as a graph, and the task is to maximize some unknown objective function, f(G),that measures the "goodness" of a molecule (where 𝐺 is a graph).This could involve evaluating chemical properties like binding affinities, toxicity, or drug-likeness scores. Each time a molecule is selected (a graph $G_t$, we receive a noisy evaluation $$y_t = f(G_t) + \epsilon_t$$

Sampling Efficiency

The algorithm is designed to reduce this computational complexity by random sub-sampling from the set of all molecules. This random sampling reduces the number of molecules to evaluate at each step, allowing the algorithm to explore a diverse set of candidates, thereby balancing exploration (finding new, promising molecules) and exploitation (refining the search based on already observed good candidates).

wonsutak-ysz commented 2 days ago
image

Proposed Method: GNN-SS

The key contribution of the paper is GNN-SS, a Graph Neural Network-powered Bayesian Optimization algorithm that incorporates random sub-sampling (SS) to make the optimization problem tractable.

Key Features of GNN-SS:

Random Sub-Sampling (SS): Instead of evaluating every molecule in the large chemical space, the algorithm selects a smaller random subset of molecules at each optimization step, thereby saving computational resources. This also encourages exploration of diverse molecules early on. Graph Neural Networks (GNNs): The algorithm uses GNNs to model the relationships between molecular structures and properties. GNNs are particularly well-suited for handling molecular data since molecules can naturally be represented as graphs. Uncertainty Estimation: The model needs to estimate uncertainty in its predictions to decide whether to explore new areas of the molecular space. The authors introduce Johnson-Lindenstrauss (JL) random feature mappings, which allow the model to efficiently handle uncertainty without costly matrix inversions.

Acquisition Function & Optimization

The model uses an acquisition function, specifically the Upper Confidence Bound (UCB), which balances the trade-off between exploration (trying new molecules) and exploitation (refining search among already promising candidates). This function evaluates the usefulness of each molecule based on the model's prediction and uncertainty.

At each iteration, the algorithm:

Samples a subset of molecules. Evaluates the acquisition function on this subset. Selects the molecules with the highest scores. Updates the GNN model based on the new evaluations.

wonsutak-ysz commented 2 days ago
image image image

Experimental Results

The paper evaluates the performance of GNN-SS on a dataset called Zinc, which contains 250,000 small organic molecules commonly used for benchmarking molecular optimization algorithms. The dataset includes multiple molecular objectives, such as quantitative drug-likeness (QED) scores, binding affinities, and other chemical properties.

Benchmarking Against Other Algorithms

The proposed algorithm is compared to 29 other methods in a benchmark called the Practical Molecular Optimization (PMO) benchmark. GNN-SS performs well, achieving state-of-the-art performance among non-generative screening methods and even outperforms some generative models. It ranks 4th overall and 1st among screening methods, demonstrating both sample efficiency and high reward molecules.

The experiments show that random sub-sampling not only saves computational costs but also helps the model explore a diverse range of molecules early on. This diversity is critical for finding optimal molecules in large chemical spaces.

Conclusion

The paper concludes by highlighting the success of GNN-SS in addressing the challenges of virtual screening for large molecular libraries. The approach efficiently handles vast search spaces by employing a combination of GNNs, random sub-sampling, and efficient uncertainty estimates.

While GNN-SS achieves competitive performance on datasets with up to 1 million molecules, the authors acknowledge that scaling the approach to billion-scale molecular databases or applying it to de novo design (generating new molecules) remains an open challenge.

2nazero commented 1 day ago

I have several questions.

  1. How are molecules represented as a graph? Like the process?
  2. How do you measure the "goodness" of a molecule?
  3. How are the challenges of the vast size of chemical libraries and high cost of evaluating properties solved by these methods?
  4. Can you explain the difference between GraphSage and GNN-SS? GraphSage also samples the neighbor nodes partially and I think GNN-SS is similar. And also why do you think they used GNN-SS instead of GraphSage in this paper?
  5. Can you explain how Johnson-Lindenstrauss (JL) random feature mappings allow the model to efficiently handle uncertainty?
wonsutak-ysz commented 1 day ago

I have several questions.

  1. How are molecules represented as a graph? Like the process?
  2. How do you measure the "goodness" of a molecule?
  3. How are the challenges of the vast size of chemical libraries and high cost of evaluating properties solved by these methods?
  4. Can you explain the difference between GraphSage and GNN-SS? GraphSage also samples the neighbor nodes partially and I think GNN-SS is similar. And also why do you think they used GNN-SS instead of GraphSage in this paper?
  5. Can you explain how Johnson-Lindenstrauss (JL) random feature mappings allow the model to efficiently handle uncertainty?

Of course,

1.A simple example is: atoms are represented as nodes, and chemical bonds as edges. Libraries like RDKit, NetworkX, and PyTorch Geometric can be used to convert chemical molecules into a GNN representation.

2.The standard is the PMO Benchmark, which includes 23 metrics such as Binding Affinity, Quantum Mechanical Energy, and QED. Rankings like top-1 and top-10 are used to measure the "goodness" of molecules.

3.Using Bayesian Optimization (BO) combined with a GNN sub-sampling method improves efficiency compared to previous global search or brute-force search methods.

4.GraphSAGE is good at handling local information aggregation in large-scale data, such as node classification in social networks, while GNN-SS work in global sub-sampling and solving global optimization problems in large-scale data.

5.The JL technique reduces the dimensionality of high-dimensional data, helping solve the computational complexity caused by matrix inversion in BO.

If you need any further explanation or details, please let me know.

d-h-lee commented 22 hours ago

Nice work @wonsutak-ysz .

Curious point raised here: "GNN sub-sampling method improves efficiency" <-- how can subsampling improve (which) efficiency?

Also, following up on the previous comment, I suggest opening up independent issues for each of the following:

wonsutak-ysz commented 18 hours ago

Curious point raised here: "GNN sub-sampling method improves efficiency" <-- how can subsampling improve (which) efficiency?

Since the size of the chemical library far exceeds the model’s evaluation capacity, the GNN-SS algorithm randomly selects a small subset of molecules from the library at each step as the candidate set. This significantly reduces the computational complexity for each iteration, without needing to evaluate the entire chemical library at every time step. This sub-sampling process also promotes exploration and diversity, helping to prevent the model from getting stuck in local optima.

I think it is a good suggestion, And I will opening up independent issues for the following two topics.

2nazero commented 18 hours ago

I think it is a good suggestion, And I will opening up independent issues for the following two topics.

Can you open the issues in this repository? I would like to see them too. Thank you.