Closed FrederikWarburg closed 2 years ago
Facing the same issues of 1. and 2.
For 2. I'm using la.optimize_prior_precision(method='marglik')
For 3. it's mentioned by the author that they think this doesn't make sense(However I think it could be useful...) https://github.com/AlexImmer/Laplace/pull/58#issue-1067104633
Thanks @FrederikWarburg for raising this issue and @Phoveran for commenting, and thanks to both for your interest in the subnetwork Laplace feature, this is very much appreciated (also sorry for the late response due to the ICML deadline)!
I just opened a PR #87 that addresses your comments. Feel free to take a look at the PR and try out the corresponding branch; let me know if you encounter any further issues (either here or directly in the PR).
Detailed comments:
torch.nn.Sequential
models; see this page of their documentation for details. I'm not sure what exactly the warning means, but wouldn't expect it to be critical. Feel free to raise an issue on the BackPACK repo if you have any issues with your models or have questions on their model support. We also support the ASDL library as a second backend which supports different models and might work in some cases in which BackPACK doesn't (and vice versa). Feel free to play around with different backends if things don't work. You can do so by passing the backend
argument to Laplace()
; possible values include BackPackGGN
, BackPackEF
, AsdlGGN
, AsdlEF
(all in laplace.curvature
).Thanks for your work! For 3. I wonder if the lottery ticket hypothesis in pruning still counts in laplace situation. However the subnetwork I found maybe too big to use full hessian. I don't know if it makes sense, but I think it deserves a try.
Could you elaborate on what exactly you mean when referring to the LTH in this context? That the diagonal Hessian might perform as well as the full Hessian on certain subnetworks?
Yes. That's my guess.
I see. It's an interesting thought, but I think there is some empirical evidence that capturing correlations is generally favourable over a diagonal approximation. E.g. in our subnetwork inference paper, we showed that estimating a full Hessian over just a small subnetwork can outperform diagonal Laplace over the full model (see e.g. Fig. 4). But there might also be cases / subnetworks where a diagonal posterior is as good as a full posterior, not sure.
I see, thanks for your information!
I'll close this issue for now (as PR #87 should address the concerns raised once merged in) -- feel free to re-open (or open another issue) if anything else arises, and thanks again for your interest in our library @FrederikWarburg and @Phoveran!
Hi!
Thanks for the cool new feature about the subnetwork. I have some comments and questions.
1) Bug in check for subnetwork
Running on GPU, it fails as you have a check for torch.longtensor, but the indices will be of type torch.cuda.longtensor and the program fails. Possible fix in utils/subnetwork line 94 change to:
2) Hyper optimizer
In your standard example, you first have trainer.fit() followed by hyperparameter optimisation like this:
Can you provide an example of how you would do something similar with the subnetwork? When I try naively, I get a dimension error:
Let me know if I should provide more details on network etc.
3) DiagLaplace for subnetwork
I would like to use the diagonal hessian structure for subnetwork. Could you provide me with some pointers to how I would do this? If I understand correctly, I cannot just:
as this will also account for the correlation between the diagonal elements. What would be the best way to combine subnetwork and laplace.DiagLaplace?
4) Strange warning:
If I code my network like this:
I get the following warning:
UserWarning: Extension saving to grad_batch does not have an extension for Module <class '__main__.get_model.<locals>.Model'> although the module has parameters
However, if I code the network like this:
I do not get any warnings. Do you know what the warning means? and if I should be careful with the first implementation?