Replicating results presented in Doubly Stochastic Variational Inference for Deep Gaussian Processes

JanSochman commented 5 years ago

Hi, has anybody succeeded in replicating the results of the paper Doubly Stochastic Variational Inference for Deep Gaussian Processes by Salimbeni and Deisenroth in GPyTorch? There is an example DeepGP notebook referring to the paper, but when I tried to run it on the datasets used by the paper I often observe divergence in the test log-likelihood (this is the example for training on kin8nm dataset).

The divergence does not occur every time, but I am not sure what is its cause and I see no way to control it...

I am attaching my modified notebook with reading of the datasets, a model without residual connections, batch size and layer dimensions as in the paper. Any idea what is happening here?

salimbeni_replication_issue.zip

Thanks, Jan

jacobrgardner commented 5 years ago

I'll take a look. The DGP implementation is very new and VI was just significantly refactored, so there's a decent chance the issue is on our end here.

JanSochman commented 5 years ago

I have just noticed that I am using PyTorch 1.3, but you are assuming <=1.2. Maybe that adds to the problem?

jacobrgardner commented 5 years ago

Everything is definitely compatible with 1.3, so that's unlikely.

You mention that the test LL doesn't always diverge: is it a frequent occurrence or just occasionally? If it's relatively infrequent I wonder how stable the original result is.

Still, I'll take a look at your example notebook and see if anything is going obviously wrong.

On Wed, Nov 13, 2019, 8:02 AM JanSochman notifications@github.com wrote:

I have just noticed that I am using PyTorch 1.3, but you are assuming <=1.2. Maybe that adds to the problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cornellius-gp/gpytorch/issues/939?email_source=notifications&email_token=AA6USCKJIOLBEEAFG6NTPIDQTQQIFA5CNFSM4JM5IDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED6UFXA#issuecomment-553468636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6USCIOZFNISJ5HCNHS6A3QTQQIFANCNFSM4JM5IDRA .

JanSochman commented 5 years ago

I do not have exact statistics, but my observation is that with smaller batch size (1024) I observe it quite regularly within 300 epochs. With recommended BS=10k, it happens less often, but happens from time to time (the above plot is from this big BS).

gpleiss commented 5 years ago

@jacobrgardner - I'm going to chuck this issue in the todo for the 1.0 release - just to make sure that DeepGPs are all ready to go when we cut the release.

JanSochman commented 5 years ago

Hi, @jacobrgardner have you managed to replicate the issue?

jacobrgardner commented 5 years ago

@JanSochman Sorry, have been pretty busy finishing up my reviews for AISTATS. I'll try to look at this soon

jacobrgardner commented 4 years ago

@JanSochman I had a chance to look in to this. I'm not totally convinced this isn't just overfitting after an extreme amount of training. Evidence for this:

Both the training loss, training RMSE, and test RMSE continue to go down over the course of training. This suggests optimization is working, and the training itself is not diverging.
If I look at the test LL score corresponding to the RMSE values that roughly match what the paper reports, those values roughly match what the paper reports as well. This makes me wonder if they maybe did some sort of early stopping, which is pretty common.
Roughly around when the test LL starts to decrease, the KL divergence term in the ELBO starts increasing while the expected log likelihood term continues decreasing. This suggests we are really fitting to the training data an awful lot.

JanSochman commented 4 years ago

Have you really been able to match the LL score from the paper?! In my case for e.g. kin8nm dataset, I am getting max LL about 0.89, whereas the paper shows LL slightly higher than 1.31 for DGP 2. Actually, 0.89 would be one of the worst results of all the tested methods... (see Salimbeni, NIPS 2017, Fig 1). I haven't been able to get closer to this number...

I am also trying to run the notebook provided by the original authors for comparison and there does not seem to be any early stopping and the values get much closer to the published ones...

JanSochman commented 4 years ago

Any idea how to reduce this overfitting? I have tried several tricks I found online and I also went through the code and compared it to the original code by Salimbeni. There were some differences, however, none of them, when implemented like in the original code, helped me with the overfitting... Not sure, what to try else... :(

gpleiss commented 4 years ago

@JanSochman - are you using a fixed release of GPyTorch or the master branch? I would try updating to the master branch if you haven't already (a new release is coming out soon).

JanSochman commented 4 years ago

I was using the official 0.3.6 release (through pip install). I upgraded it to the main branch yesterday, but the behavior is the same...

JanSochman commented 4 years ago

Good news! I upgraded to GPyTorch 1.0, started from scratch from the updated Deep GP notebook and it does not diverge anymore!

However, the test LLs are widely different from the original paper. Depending on the dataset, they can be both significantly higher or lower... Probably a difference in the model? Or optimizer? Or does the LL computation depend on the experimental setup somehow? Or, is GPyTorch just better at reaching higher LLs than GPflow?

Anyway, thanks for this update!

gpleiss commented 4 years ago

Hi @JanSochman - there is a model difference. Our deep GP notebook uses some skip connections (similar to a resnet). I imagine this is what contributes to the different LLs.

Try removing the skip connection (see the __call__ method in the deep gp example) and see what happens

JanSochman commented 4 years ago

Sorry, I forgot to mention that I've removed the skip connections already... I also specified the number of inducing points to 100. Any other differences which could cause this?

gpleiss commented 4 years ago

The thing that comes to the top of my head is the number of samples drawn to make predictions (and the number of samples drawn at training time). I imagine this could cause the difference in LL.

JanSochman commented 4 years ago

I found it! The difference is here:

(page 6) "... we use 20-fold cross validation with a 10% randomly selected held out test set and scale the inputs and outputs to zero mean and unit standard deviation within the training set (we restore the output scaling for evaluation)."

After restoring the scaling I am getting similar results as in the Salimbeni's paper.

Thanks for your help!

gpleiss commented 4 years ago

Awesome! Glad you figured it out :)

cornellius-gp / gpytorch

Replicating results presented in Doubly Stochastic Variational Inference for Deep Gaussian Processes #939