Open karalets opened 4 years ago
@karalets
Sorry for being out of the loop for a few days.
I left a few experiment results here:
https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-05-173959786741/report.md
It seems that the model did not converge.
I did some sanity check and it seems that there is no issues when it comes to minimize MSE alone.
Also, for this formulation, what would prevent the model from predicting sigma close to zero and yield very high likelihood, regardless of mean point estimate?
Hey,
Thanks for adding this. Can we unpack these results and contrast them with MSE? Also, if you ignore the variance from the neural network output and use the loglikelihood with a fixed variance of 1.0 you should exactly recover the MSE loss.
i.e. dist = Normal(mean=theta_mean, sigma=torch.ones(...))
Could you please run this as a sanity check quickly?
And to clarify this is the case: look at the functional form of log (Gaussian(mu, 1))
, it exactly reverts to the MSE term which is in the exponent of the gaussian as the sigma of 1 disappears.
https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/report.md
Yes fixing sigma
to 1 makes it train just fine.
Great, that means we have numerical instabilities with learning sigma.
One thing immediately comes to mind: sigma has to be positive, so typically we learn a transformed version of it with NNs that can only map to positive values.
i.e. bla = nn(x)
; sigma = exp(bla)
-> N(mu, sigma)
is defined as sigma
is ensured to be positive.
Turns out bla = log(sigma)
implicitly in that parametrization (but one never writes that).
Is that how you ensure positivity of sigma? Can you point me to the code?
Now I've constraint it to be positive by adding a parameter transformation function.
https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23
But numerical stability aside, I still don't understand why the model wouldn't predict a super small sigma and thus get a huge likelihood value for even some of the data points.
You misunderstand the purpose of the noise model.
It can only do this if there is no measurement error whatsoever and the mean can always be great. If, however, there is any error, it needs to learn a noise shape.
This entire discussion is about the shape of observation noise. A model would never learn infinitely small sigma unless it has noiseless outputs which it can predict perfectly, in which case an infinitely small sigma is perfectly appropriate.
In words: the sigma here represents the model's belief about the noisy-ness of the outputs. This is not the same as model uncertainty, it is irreducible uncertainty that comes from noise, not from model uncertainty.
I think I'm going to try other means to constraint sigma to be positive, namely exp rather than abs.
Again, please point me to the code now, there are best practices here that I would like to encourage. This is almost certainly the source of the behavior you encounter.
Also if you look into my example above in this issue, I showed how to do this exactly in code and I never used an abs.
reminder:
def condition(self, X):
theta = self.forward(X)
mu = theta[0]
sigma = torch.exp(theta[1])
dist = Normal(mu, sigma)
return dist
Yes. The only difference is abs rather than exp. I'll change it to exp now.
https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23
Now I've constraint it to be positive by adding a parameter transformation function.
https://github.com/choderalab/pinot/blob/2f2996311fd418f24b3f1a9c94aa654f39605d95/pinot/net.py#L23
Yeah abs is no good. Use Exp (you could also do softplus).
Abs has problems with gradients.
Yes. The only difference is abs rather than exp. I'll change it to exp now.
https://github.com/choderalab/pinot/blob/master/pinot/net.py#L23
Please retry training after this and compare to the likelihoods and errors you get when setting sigma to 1.
Both are converging tho, judging from the loss trajectory:
sigma=1: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-151826363179/loss.png
free sigma: https://github.com/choderalab/pinot/blob/master/pinot/app/2020-04-10-165531617559/loss.png
let me just run this for a bit more epochs
That makes no sense... maybe play with hyperparameters and plot some loss curves to inspect what is actually happening?
MSE is clearly the wrong model, it can't be that a slightly and modestly better model is so much worse, especially given that the mean is still the same and all this does is rescale the error residuals.
seems that for the Gaussian model the results are not consistent across multiple experiments. looking into that
ah adding a small epsilon on sigma seems to fix it
ok then there is a numerical issue somewhere, that object should not even need the epsilon... especially since the epsilon changes the noise model to be 'at least epsilon'
You could also try lower bounding the noise, i.e. by clipping in pytorch to be at least some reasonable value.
Question: how are we going to preserve this sigma thing after we move on to Bayesian methods namely sampling using Monte Carlo?
Going back to this issue a bit. Seems that the behavior is not consistent across multiple experiments.
@karalets @maxentile
the scripts for this experiment is here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE
Sorry for slow response!
A couple questions:
All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.
On Thu, May 7, 2020 at 9:57 AM Josh Fass notifications@github.com wrote:
Sorry for slow response!
A couple questions:
- Do you have a prior defined on sigma (or log_sigma)?
- Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
- For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .
Also: we had a manual chat on this last week with YQ, I would avocate to first try simple things like using a globally learned sigma, and in both homoschedastic and heteroschedastic sigma cases to prune the sigma to 'simulate' a constraint. If it trains better with some pruning with min-max values this will help.
We will put priors on everything in a future iteration, and priors are effectively a softer version of such pruning, but just identifying the issue is done easily with simple things like constraints.
We won't need the additive epsilon that YQ introduced in the future once this is clarified.
On Thu, May 7, 2020 at 10:16 AM Theofanis Karaletsos < theofanis.karaletsos@googlemail.com> wrote:
All this should not be a problem once the pipeline is cleaned up enough that we can quickly iterate on some things.
On Thu, May 7, 2020 at 9:57 AM Josh Fass notifications@github.com wrote:
Sorry for slow response!
A couple questions:
- Do you have a prior defined on sigma (or log_sigma)?
- Could you comment on what exactly is different between the multiple experiments plotted above? (different random seed only, or different data subsample, or ...)
- For the experiments where you plot RMSE as a function of iteration, could you also plot the log_probability and the sigma (or log_sigma) parameter? These may be diagnostic.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/choderalab/pinot/issues/1#issuecomment-625375798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWCL7N63XM4DDDGPQAKNTRQLSATANCNFSM4LVJPPTA .
tried all options here: https://colab.research.google.com/drive/1B_-M0Dn9TekYfEFkJD4QppkpJmx4h4iE?usp=sharing
pinot.Net
Also tagging @dnguyen1196 here to be aware of these discussions.
Although not sure if this is representative, heteroscedastic model in this case indeed maximized the likelihood of training data, but it does not generalize well?
These are trained on the entire ESOL dataset for 2000 epochs.
(pardon me for the typo in the last section, sigma is trainable for the third experiment.)
https://github.com/choderalab/pinot/blob/e1d326043f2157f8572518a609c981cb0f444941/pinot/app/gaussian_variational.py#L16
Here, you define a loss by handwriting some math. This is doable, but I have some problems with it.
First, the loss is basically the negative loglikelihood of the output noise model. As such, it is a model property and should live inside the model.
This leads me to the second point: We should probably have models being classes here, i.e.
class Molecule_graph_nn(nn.Module):...
Third, while I highly appreciate you writing the math directly here, I also do recommend using some things that may be less error-prone.
In particular, my team wrote the distributions library for pytorch: https://pytorch.org/docs/stable/distributions.html
Here, you can do the following quite nicely (see https://pytorch.org/docs/stable/distributions.html#normal)
Even better would be having this fully modularized.
i.e.
Then you can access loss from your model:
Edit: I just noticed I omitted explicitly accounting for measurements
m
in here during loss etc. Let me know if I should edit to clarify or it is self explanatory.