Closed JanSochman closed 4 years ago
I'll take a look. The DGP implementation is very new and VI was just significantly refactored, so there's a decent chance the issue is on our end here.
I have just noticed that I am using PyTorch 1.3, but you are assuming <=1.2. Maybe that adds to the problem?
Everything is definitely compatible with 1.3, so that's unlikely.
You mention that the test LL doesn't always diverge: is it a frequent occurrence or just occasionally? If it's relatively infrequent I wonder how stable the original result is.
Still, I'll take a look at your example notebook and see if anything is going obviously wrong.
On Wed, Nov 13, 2019, 8:02 AM JanSochman notifications@github.com wrote:
I have just noticed that I am using PyTorch 1.3, but you are assuming <=1.2. Maybe that adds to the problem?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cornellius-gp/gpytorch/issues/939?email_source=notifications&email_token=AA6USCKJIOLBEEAFG6NTPIDQTQQIFA5CNFSM4JM5IDRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED6UFXA#issuecomment-553468636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6USCIOZFNISJ5HCNHS6A3QTQQIFANCNFSM4JM5IDRA .
I do not have exact statistics, but my observation is that with smaller batch size (1024) I observe it quite regularly within 300 epochs. With recommended BS=10k, it happens less often, but happens from time to time (the above plot is from this big BS).
@jacobrgardner - I'm going to chuck this issue in the todo for the 1.0 release - just to make sure that DeepGPs are all ready to go when we cut the release.
Hi, @jacobrgardner have you managed to replicate the issue?
@JanSochman Sorry, have been pretty busy finishing up my reviews for AISTATS. I'll try to look at this soon
@JanSochman I had a chance to look in to this. I'm not totally convinced this isn't just overfitting after an extreme amount of training. Evidence for this:
Have you really been able to match the LL score from the paper?! In my case for e.g. kin8nm dataset, I am getting max LL about 0.89, whereas the paper shows LL slightly higher than 1.31 for DGP 2. Actually, 0.89 would be one of the worst results of all the tested methods... (see Salimbeni, NIPS 2017, Fig 1). I haven't been able to get closer to this number...
I am also trying to run the notebook provided by the original authors for comparison and there does not seem to be any early stopping and the values get much closer to the published ones...
Any idea how to reduce this overfitting? I have tried several tricks I found online and I also went through the code and compared it to the original code by Salimbeni. There were some differences, however, none of them, when implemented like in the original code, helped me with the overfitting... Not sure, what to try else... :(
@JanSochman - are you using a fixed release of GPyTorch or the master
branch? I would try updating to the master branch if you haven't already (a new release is coming out soon).
I was using the official 0.3.6 release (through pip install). I upgraded it to the main branch yesterday, but the behavior is the same...
Good news! I upgraded to GPyTorch 1.0, started from scratch from the updated Deep GP notebook and it does not diverge anymore!
However, the test LLs are widely different from the original paper. Depending on the dataset, they can be both significantly higher or lower... Probably a difference in the model? Or optimizer? Or does the LL computation depend on the experimental setup somehow? Or, is GPyTorch just better at reaching higher LLs than GPflow?
Anyway, thanks for this update!
Hi @JanSochman - there is a model difference. Our deep GP notebook uses some skip connections (similar to a resnet). I imagine this is what contributes to the different LLs.
Try removing the skip connection (see the __call__
method in the deep gp example) and see what happens
Sorry, I forgot to mention that I've removed the skip connections already... I also specified the number of inducing points to 100. Any other differences which could cause this?
The thing that comes to the top of my head is the number of samples drawn to make predictions (and the number of samples drawn at training time). I imagine this could cause the difference in LL.
I found it! The difference is here:
(page 6) "... we use 20-fold cross validation with a 10% randomly selected held out test set and scale the inputs and outputs to zero mean and unit standard deviation within the training set (we restore the output scaling for evaluation)."
After restoring the scaling I am getting similar results as in the Salimbeni's paper.
Thanks for your help!
Awesome! Glad you figured it out :)
Hi, has anybody succeeded in replicating the results of the paper Doubly Stochastic Variational Inference for Deep Gaussian Processes by Salimbeni and Deisenroth in GPyTorch? There is an example DeepGP notebook referring to the paper, but when I tried to run it on the datasets used by the paper I often observe divergence in the test log-likelihood (this is the example for training on kin8nm dataset).
The divergence does not occur every time, but I am not sure what is its cause and I see no way to control it...
I am attaching my modified notebook with reading of the datasets, a model without residual connections, batch size and layer dimensions as in the paper. Any idea what is happening here?
salimbeni_replication_issue.zip
Thanks, Jan