GMvandeVen / continual-learning

PyTorch implementation of various methods for continual learning (XdG, EWC, SI, LwF, FROMP, DGR, BI-R, ER, A-GEM, iCaRL, Generative Classifier) in three different scenarios.
MIT License
1.54k stars 310 forks source link

Grad in SI #14

Open ssydasheng opened 3 years ago

ssydasheng commented 3 years ago

Hi,

I am recently reading your excellent continual-learning implementation, in particular about the SI. In the following line of code, you used p.grad, which is the gradient of the regularized loss. However, based on my understanding about SI, the gradient should be computed merely on the data loss, so that it measures how much each weight contributes to the fitting error of the present task. Am I wrong about it, or I missed important factors in your implementation? Thanks ahead for your clarification.

https://github.com/GMvandeVen/continual-learning/blob/d281967802396b146b2c30b6667369c6f2395472/train.py#L248

GMvandeVen commented 3 years ago

Hi, thanks for your interest in the code! You're correct that in the line you refer to, p.grad is the gradient of the regularized loss. But my understanding of SI is that it is this gradient that should be used for computing the parameter regularisation strengths. I think it's outside of the scope of this issue to go into to much depth, but what I found to be a useful paper in this regard (even though it is about EWC rather than SI) is this one: https://arxiv.org/abs/1712.03847. Hope this helps!

GMvandeVen commented 3 years ago

Actually, thinking a bit more about, I'm not sure anymore how relevant the above paper is to this specific issue. Sorry. I quickly checked the original paper on SI and my impression is that based on the description in that paper it is ambiguous whether they used the gradient of the reguarized or the unregularized loss. (But please correct me if you can point out somewhere in that paper where this is specified!) Anyway, for this code I intended to follow the approach as described in Eq 13-15 in this paper (https://www.nature.com/articles/s41467-020-17866-2). Whether or not that is the most appropriate approach is an interesting question, I'll need to think more about that. My guess is that in practice it will not matter too much, although that remains to be demonstrated.

ssydasheng commented 3 years ago

Thank you very much for your detailed response. It seems that the nature paper you referred to indeed used the regularized loss. And I haven't found in the SI paper specifying whether they used the regularized loss or the original loss. I tried to read the released code of the SI paper, but frankly speaking I didn't manage to understand their code well enough either. That's why I came to read your implementation.

Despite that, I noticed in their released code, the original loss is used, in the following line, https://github.com/ganguli-lab/pathint/blob/master/pathint/optimizers.py#L123 Furthermore, the unreg_grads is used in the following line, https://github.com/ganguli-lab/pathint/blob/master/pathint/protocols.py#L34. That's why I thought that they used the original loss, though I am not sure about this.

I agree with you that in practice it might not matter too much. For using the regularized loss, I think it is similar to use larger coefficients for earlier tasks. Since the Hessian of the quadratic regularization captures the contributions of the previous tasks, and their contribution also exists in the omega. Then previous tasks will be counted multiple times.

GMvandeVen commented 3 years ago

Interesting, thanks! When I have some time I might look more into this.