facebookresearch / higher

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.
Apache License 2.0
1.58k stars 123 forks source link

is the accumulation of gradient done right, where do we divide the accumulator by number of tasks? #92

Closed brando90 closed 3 years ago

brando90 commented 3 years ago

I was reading the maml example and saw this:

https://github.com/facebookresearch/higher/blob/941ae9f310994e728064f8edd7f17c57550e7c67/examples/maml-omniglot.py#L165

which seems right locally (since it is accumulating gradients). But then it never divides by num_tasks. I think that might actually matter. Usually I believe in normal supervised learning it is not a big deal because then the step size controls that quantity via lr * B where B is the batch size. But here the situation might be more subtle because there is correlation between the tasks and thus the standard deviation of that estimate might be different (need to check the details for this). But I wanted to draw attention to it since it might be a nuance issue in the MAML example.

egrefen commented 3 years ago

In practice, you are right that you might want to divide through by the meta-minibatch size (number of tasks) if you want to interpret it as a Monte Carlo estimate of the batch meta-gradient. However, for a fixed meta-minibatch size, this doesn't really matter. The normalization term just implicitly becomes part of the learning rate.

egrefen commented 3 years ago

Sorry, to answer your second question from the title, if you want to do this, "right", replace

qry_loss.backward()

with

(qry_loss / num_tasks).backward() 
brando90 commented 3 years ago

I actually think it does matter for meta-learning but not for regular supervised learning. Ran out of time to make the maths formal I will tomorrow.

brando90 commented 3 years ago

@egrefen got it. They are indeed not the same even in the standard machine learning with no meta-learning. Using the sum has the same mean as the average if you choose the right step size but, they do not have the same standard deviation (that cannot be fixed by any stepsize since a step size cannot be equal to the size of the batch and the square root of the same time, this indeed assumes the gradients are normally distributed which might not be always true but I have some work that shows this tends to be true, otherwise the analysis is really complicated I'd assume). In practice, the sum one might be better (due to the different std) but that is mainly an empirical question to check on the test set.

With respect to meta-learning, I have not completed the analysis but things are usually worse there because the tasks are correlated (in few shot learning like mini-imagenet the issue is even worse!), so the iid assumption doesn't hold when sampling a tasks (i.e. sampling the meta-batch). The intuition is that the std noise might be even higher since the covariance is going to be included. I empirically noticed this when my estimates for the meta-test were really really noisy for some reason. I recommend dividing by num_tasks to avoid extra noisy gradients. Thus your comment:

However, for a fixed meta-minibatch size, this doesn't really matter. The normalization term just implicitly becomes part of the learning rate.

requires more precision and is more nuanced. It's even subtle for normal supervised learning.