choderalab / pinot

Probabilistic Inference for NOvel Therapeutics
MIT License
15 stars 2 forks source link

[incomplete] Updated README #21

Open dnguyen1196 opened 4 years ago

dnguyen1196 commented 4 years ago

I think an updated readme will help to see how different components are connected, even if the structure of the project might change. I'll volunteer to do this since it will help me get a better sense of the project anyway. This issue is for discussions.

Goals of the project

PINOT will be part of a larger reinforcement learning pipeline that aims at automated drug discovery. Its primary goal is to perform uncertainty-calibrated prediction of molecular properties. This is extremely useful in drug discovery. In addition to frequentist estimates (think maximum likelihood or MAP) of a compound's property, a good estimation of uncertainty will allow the reinforcement learning agent to make better decisions. The RL agent can also better balance the exploitation-exploration tradeoffs, leading to real-life cost and time savings.

PINOT is a marriage between two frameworks: Bayesian learning and deep learning. Bayesian learning is a principled framework for making predictions under uncertainty. On the other hand, deep learning provides flexible toolkits that can learn very complex patterns from chemical and molecular data.

Overall project structure

In folder pinot/ you will see 5 main modules and two top-level files:

  1. app
  2. inference
  3. regression
  4. representation
  5. tests net.py graph.py

net.py

This file contains the implementation of the class Net. Intuitively, one can think of Net as the Bayesian Neural Network (BNN), the goal of PINOT. To construct an instance of a BNN, we need two main components. Firstly, we need a neural network architecture that can learn representations of chemical compounds, thus the representation module. In addition, we need the parameterization module to transform the latent representation to distribution over the output (say distribution over predicted toxicity of a compound). For example, the parameterization can take in a latent representation and output the mean and variance of the prediction.

graph.py

This file contains helper functions and tools to create graph representations of chemical molecules from their atom and bond information. The tools are part of the RDKit open source project (https://www.rdkit.org/docs/index.html).

app

This module contains tools to perform standardized, rapid, and large scale tests of our BNNs, as well as visualization of training and testing progress. The purpose of this module is for us to have a testing pipeline that is standardized so that we know exactly what performance metrics and can quickly get a lot of diagnostic tests.

TODO: We would like to have a more flexible and or more comprehensive testing/training tool that keeps track of a number of metrics (and users can choose which metric to keep track of from a list of implemented metrics.)

inference

This module contains implementations of different algorithms to perform inference of Bayesian neural networks (learning the weights and parameters of the networks). As of this writing, we have 4 inference algorithms: Stochastic Gradient Langevin Dynamics, Adaptive Langevin Dynamics, Stein Variational Gradient Descent, and Bayesian by Backprop.

regression

TO-DO

representation

This module contains implementations of a general graph neural network as well as functionalities to construct a sequence of graph neural networks. In recent years, graph neural networks have become the state-of-the-art tools in molecular deep-learning.

TODO: We would like to have both differentiable representation (where the neural network mapping from molecular information to representation is trainable) and fixed representation (pre-trained graph neural networks for molecular fingerprints)

How the different components interact

Examples

Example scripts can be found in the scripts/ folder.

Related work

This project lies at the intersection of graph neural networks and Bayesian inference algorithms based on stochastic gradient MCMC.

Graph Neural Networks in biochemistry

Neural Message Passing for Quantum Chemistry

Convolution Graph Neural Networks for molecular fingerprints learning

Bayesian inference algorithms based on stochastic gradients MCMC

Bayesian learning via Stochastic Gradient Langevin Dynamics

Stochastic Gradient Hamiltonian Monte Carlo

Bayesian Sampling using Stochastic Gradient Thermostat

maxentile commented 4 years ago

Nice -- thank you for starting this! I was just mentioning to @yuanqing-wang that I would benefit a lot from having a statement of the scope of this project. The orientation you've provided so far here is very very helpful!

A section on Related Work (with subsections for "RL in drug discovery", "Bayesian RL", etc.) may also be good to include early.

yuanqing-wang commented 4 years ago

Thanks!

I'll share a statement of problem with y'all within days.

yuanqing-wang commented 4 years ago

sketched some initial ideas here https://github.com/choderalab/pinot/tree/infrastructure/drafts/problem-statement and share with you through overleaf.

karalets commented 4 years ago

I'd prefer to keep the RL agent separate since that doc talks about graph-nets and mentions Bayesian sth., but not any RL.

Also we had an earlier doc about RL that I started sketching, but in my opinion we are still very far from RL in this project at this stage which I why I paused on discussing it until we can reasonably attack that topic.

The initial paper I have in mind as 'feasible' that I mentioned to discuss on Friday is about experimental design/active learning, not RL, as we "may" have a chance to do that sooner.

My suggestion is you share the paper by the Cambridge group and we focus around that and building a similar workflow and iterate on the graph-nets, unsupervised- and semi-supervised learning, and inference and leave the RL for when it's time for RL.

yuanqing-wang commented 4 years ago

the paper @karalets mentioned is this one https://doi.org/10.1039/C9SC00616H

karalets commented 4 years ago

Also please have a look at issue #3 as the intro to it gives some context here.