check any discrepancies/erorrs in AdLaLa

yuanqing-wang commented 4 years ago

I re-implemented AdLaLa integrator by Tiffany and Ben (https://github.com/TiffanyVlaar/ThermodynamicParameterizationOfNNs/blob/master/AdLaLa.ipynb) here:

https://github.com/choderalab/pinot/blob/master/pinot/inference/adlala.py

jchodera commented 4 years ago

Might be good to bring this into master via a PR to get line-by-line review?

karalets commented 4 years ago

Thanks! Very interesting, I obviously do not know this work.

I think it would be great if we generally do PRs for new features to get line-by-line reviews, that also helps focus discussions inside the PR.

jchodera commented 4 years ago

@karalets : Check out the paper from Ben Leimkuhler on using adaptive Langevin in layers.

We've been having trouble training graph nets and have tracked it down to the optimizer, so we're trying out some MCMC-inspired schemes from our collaborators at Edinburgh!

karalets commented 4 years ago

Ok, I have the following comments to make ahead of even reading this paper: Evaluating training performance on graph nets should be done on a dataset that there are published results on for specific methods and published code for.

Then one can compare inference methods etc. . This methoid seems to conflate optimization with Bayesian Inference, and AFAIK we don;t even have an HMC baseline done for the networks which would have been easy to do once the m,odels are semi-solid in their definition.

For instance: I am still confused about the noise model needing an epsilon, I have never seen that needed before. It points to a potential bug somewhere or indeed phenomena during training we ignore.

Can we find a training dataset with published results using optimization we try to reproduce exactly? Jumping to a fairly complex Bayesian Inference method is interesting, but may not help find errors/bugs that are upstream of inference.

maxentile commented 4 years ago

I'm not sure if the primary goal here is "approximate sampling of a posterior distribution, possibly faster than HMC" or if the goal is "find a more reliable way to minimize the training loss," but that changes what experiments make sense to do.

I'm not sure about the other tasks within this project, but the observation so far on the Slack was that Adam may be poorly suited to energy regression tasks that contain a "molecular-mechanics model as a layer", since Adam fails to make the training loss small even with very small step sizes on very small datasets (even when the graph-net is removed from the model so that each molecular mechanics parameter is adjustable independently). To broadly check whether this observation was optimizer-specific vs. due to some model implementation / self-consistency issue, @yuanqing-wang kept the model and task the same but replaced Adam with L-BFGS. L-BFGS solved the task in a few steps. I think the take-away was that we may need to look at other optimizers than Adam when the output of the graph-net is being fed into a molecular mechanics model.

We're broadly curious about applying molecular-dynamics-flavored sampling and optimization methods. We're specifically interested in the recently proposed AdLaLa method, but don't have much numerical experience with this on any graph-net problems yet, with or without a molecular-mechanics model as a "layer." One suggestion in the paper is that AdLaLa may be more robust than Adam in some circumstances, even when assessed as an optimizer. I think @yuanqing-wang and @jchodera are interested in checking whether graph-nets and/or MM layers provide such circumstances. To assess this method's utility as a drop-in replacement for standard optimizers, an appropriate comparison is to Adam, rather than to HMC. To assess this method's utility as an approximate sampler, an appropriate comparison is to HMC.

@yuanqing-wang : Regarding checking that this implementation is an accurate clone of the paper implementation, I think it would be helpful if the class included

[ ] a docstring describing all the parameters,
[ ] brief comments on intended similarities and differences relative to reference implementations,
[ ] more specific references into the paper or source implementation, such as page or equation numbers for each update.

It is helpful that there are already some line comments notating the intent of some code blocks, which helped with self-consistency review in the colab notebook last week. What I and other possible reviewers would ideally want to do is a side-by-side comparison of authors' definitions and the current port, and anything that helps make such a side-by-side comparison easier will help build confidence in the implementation. Aside from comparing implementation details with documented intent, another way to build confidence in a complicated implementation like this is to confirm that it has similar behavior to a reliable reference implementation, possibly by checking things like: whether the toy-task loss trajectories look similar to the ones in e.g. Fig 9, whether weight histograms after running on the toy task look similar to those in Figs 6-7, and--most expensively--whether toy-task performance as a function of a couple hyperparameters looks broadly similar to Fig 15.

yuanqing-wang commented 4 years ago

Sorry I already committed the code to master.

Created another two branches for code review. I'll follow the guidelines closely from now.

https://github.com/choderalab/pinot/pull/7

choderalab / pinot

check any discrepancies/erorrs in AdLaLa #6