How to define a model and a loss

karalets commented 4 years ago

https://github.com/choderalab/pinot/blob/e1d326043f2157f8572518a609c981cb0f444941/pinot/app/gaussian_variational.py#L16

Here, you define a loss by handwriting some math. This is doable, but I have some problems with it.

First, the loss is basically the negative loglikelihood of the output noise model. As such, it is a model property and should live inside the model.

This leads me to the second point: We should probably have models being classes here, i.e.

class Molecule_graph_nn(nn.Module):...

Third, while I highly appreciate you writing the math directly here, I also do recommend using some things that may be less error-prone.

In particular, my team wrote the distributions library for pytorch: https://pytorch.org/docs/stable/distributions.html

Here, you can do the following quite nicely (see https://pytorch.org/docs/stable/distributions.html#normal)

import pytorch
from pytorch.distributions import Normal
import Molecule_graph_nn

graph_nn_molecules = Molecule_graph_nn(args,...)

# consider X being a molecule input
[mu_network, log_sigma_network] = graph_nn_molecules.forward(X)

mu = mu_network
sigma = torch.exp(log_sigma_network)
dist = Normal(mu, sigma)

loss = -dist.log_prob(X)

#now you can use loss inside a regular training loop

Even better would be having this fully modularized.

i.e.

class Molecule_graph_nn(nn.Module):...

    def condition(self, X):
        theta = self.forward(X)
        mu = theta[0]
        sigma = torch.exp(theta[1])
        dist = Normal(mu, sigma)
        return dist

    def sample(self, X):
        dist = self.condition(X)
        m_sample = dist.sample()
        return m_sample

    def loss(self, X, theta):
        dist = self.condition(X)
        loss = -dist.log_prob(X)
        return loss

    def get_loss(self, X):
         theta = self.forward(X)
         loss_val = self.loss(X, theta)
        return loss_val

Then you can access loss from your model:


graph_nn_molecules = Molecule_graph_nn(args,...)
# consider X being a molecule input

loss=graph_nn_molecules.get_loss(X)

#now you can use loss inside a regular training loop
....

Edit: I just noticed I omitted explicitly accounting for measurements m in here during loss etc. Let me know if I should edit to clarify or it is self explanatory.

yuanqing-wang commented 4 years ago

@karalets

Sorry for being out of the loop for a few days.

I left a few experiment results here:

https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-05-173959786741/report.md

It seems that the model did not converge.

I did some sanity check and it seems that there is no issues when it comes to minimize MSE alone.

Also, for this formulation, what would prevent the model from predicting sigma close to zero and yield very high likelihood, regardless of mean point estimate?

karalets commented 4 years ago

Hey,

Thanks for adding this. Can we unpack these results and contrast them with MSE? Also, if you ignore the variance from the neural network output and use the loglikelihood with a fixed variance of 1.0 you should exactly recover the MSE loss.

i.e. dist = Normal(mean=theta_mean, sigma=torch.ones(...))

Could you please run this as a sanity check quickly?

And to clarify this is the case: look at the functional form of log (Gaussian(mu, 1)), it exactly reverts to the MSE term which is in the exponent of the gaussian as the sigma of 1 disappears.

yuanqing-wang commented 4 years ago

https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/report.md

Yes fixing sigma to 1 makes it train just fine.

karalets commented 4 years ago

Great, that means we have numerical instabilities with learning sigma.

One thing immediately comes to mind: sigma has to be positive, so typically we learn a transformed version of it with NNs that can only map to positive values. i.e. bla = nn(x); sigma = exp(bla) -> N(mu, sigma)is defined as sigma is ensured to be positive. Turns out bla = log(sigma) implicitly in that parametrization (but one never writes that).

Is that how you ensure positivity of sigma? Can you point me to the code?

yuanqing-wang commented 4 years ago

Now I've constraint it to be positive by adding a parameter transformation function.

https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23

yuanqing-wang commented 4 years ago

But numerical stability aside, I still don't understand why the model wouldn't predict a super small sigma and thus get a huge likelihood value for even some of the data points.

karalets commented 4 years ago

You misunderstand the purpose of the noise model.

It can only do this if there is no measurement error whatsoever and the mean can always be great. If, however, there is any error, it needs to learn a noise shape.

This entire discussion is about the shape of observation noise. A model would never learn infinitely small sigma unless it has noiseless outputs which it can predict perfectly, in which case an infinitely small sigma is perfectly appropriate.

In words: the sigma here represents the model's belief about the noisy-ness of the outputs. This is not the same as model uncertainty, it is irreducible uncertainty that comes from noise, not from model uncertainty.

yuanqing-wang commented 4 years ago

I think I'm going to try other means to constraint sigma to be positive, namely exp rather than abs.

karalets commented 4 years ago

Again, please point me to the code now, there are best practices here that I would like to encourage. This is almost certainly the source of the behavior you encounter.

karalets commented 4 years ago

Also if you look into my example above in this issue, I showed how to do this exactly in code and I never used an abs.

reminder:

 def condition(self, X):
        theta = self.forward(X)
        mu = theta[0]
        sigma = torch.exp(theta[1])
        dist = Normal(mu, sigma)
        return dist

yuanqing-wang commented 4 years ago

Yes. The only difference is abs rather than exp. I'll change it to exp now.

https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23

karalets commented 4 years ago

Now I've constraint it to be positive by adding a parameter transformation function.

https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23

Yeah abs is no good. Use Exp (you could also do softplus).

Abs has problems with gradients.

karalets commented 4 years ago

Yes. The only difference is abs rather than exp. I'll change it to exp now.

https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23

Please retry training after this and compare to the likelihoods and errors you get when setting sigma to 1.

yuanqing-wang commented 4 years ago

Free sigma: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-165531617559/report.md

sigma=1: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/report.md

yuanqing-wang commented 4 years ago

Both are converging tho, judging from the loss trajectory:

sigma=1: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/loss.png

free sigma: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-165531617559/loss.png

let me just run this for a bit more epochs

karalets commented 4 years ago

That makes no sense... maybe play with hyperparameters and plot some loss curves to inspect what is actually happening?

MSE is clearly the wrong model, it can't be that a slightly and modestly better model is so much worse, especially given that the mean is still the same and all this does is rescale the error residuals.

yuanqing-wang commented 4 years ago

seems that for the Gaussian model the results are not consistent across multiple experiments. looking into that

yuanqing-wang commented 4 years ago

ah adding a small epsilon on sigma seems to fix it

karalets commented 4 years ago

ok then there is a numerical issue somewhere, that object should not even need the epsilon... especially since the epsilon changes the noise model to be 'at least epsilon'

You could also try lower bounding the noise, i.e. by clipping in pytorch to be at least some reasonable value.

yuanqing-wang commented 4 years ago

Question: how are we going to preserve this sigma thing after we move on to Bayesian methods namely sampling using Monte Carlo?

yuanqing-wang commented 4 years ago

Going back to this issue a bit. Seems that the behavior is not consistent across multiple experiments.

yuanqing-wang commented 4 years ago

@karalets @maxentile

the scripts for this experiment is here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE

maxentile commented 4 years ago

Sorry for slow response!

A couple questions:

Do you have a prior defined on sigma (or log_sigma)?
Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

karalets commented 4 years ago

All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.

On Thu, May 7, 2020 at 9:57 AM Josh Fass notifications@github.com wrote:

Sorry for slow response!

A couple questions:

Do you have a prior defined on sigma (or log_sigma)?

Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)

For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .

karalets commented 4 years ago

Also: we had a manual chat on this last week with YQ, I would avocate to first try simple things like using a globally learned sigma, and in both homoschedastic and heteroschedastic sigma cases to prune the sigma to 'simulate' a constraint. If it trains better with some pruning with min-max values this will help.

We will put priors on everything in a future iteration, and priors are effectively a softer version of such pruning, but just identifying the issue is done easily with simple things like constraints.

We won't need the additive epsilon that YQ introduced in the future once this is clarified.

On Thu, May 7, 2020 at 10:16 AM Theofanis Karaletsos < theofanis.karaletsos@googlemail.com> wrote:

All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.

On Thu, May 7, 2020 at 9:57 AM Josh Fass notifications@github.com wrote:

Sorry for slow response!

A couple questions:

Do you have a prior defined on sigma (or log_sigma)?

Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)

For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .

yuanqing-wang commented 4 years ago

tried all options here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE?usp=sharing

yuanqing-wang commented 4 years ago

[x] track values of sigma
[x] report likelihood too
[x] clipping at a larger value
[ ] enable priors to be encoded in pinot.Net
[ ] priors
[ ] infrastructure to make giant tables

karalets commented 4 years ago

Also tagging @dnguyen1196 here to be aware of these discussions.

yuanqing-wang commented 4 years ago

Although not sure if this is representative, heteroscedastic model in this case indeed maximized the likelihood of training data, but it does not generalize well?

Results: https://htmlpreview.github.io/?https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/results.html

These are trained on the entire ESOL dataset for 2000 epochs.

(pardon me for the typo in the last section, sigma is trainable for the third experiment.)

Scripts: https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/run.py

choderalab / pinot

How to define a model and a loss #1