freezing trainable parameters of other learners while training one separate learner

andre1q commented 4 years ago

Hi, Thanks for the code. Can you help me please to understand something, I wonder if we need to freeze parameters of other learners while training separate learner with something like:

''' freezing other learners while training current learner ''' for name, params in embeddings[curr_learner].named_parameters(): params.requires_grad = False opt = optimizers [curr_learner] out = embeddingscurr_learner out = torch.nn.functional.normalize(out, dim=1) loss = criterion(out, y) opt.zero_grad() loss.backward() opt.step()

''' and then unfreezing other learners ''' for name, params in embeddings[cur_learner].named_parameters(): params.requires_grad = True

I didn`t find anything like that in your code, so can you please explain, do we really need something like that? Most likely I'm wrong, but i thought it is the only way to freeze params. Thanks a lot!

melgor commented 4 years ago

Did you try your implementation? Look like it should work. About lacking this portion of code, I think that when we use only portion of embedding, then other parts does not have any gradient at all. And as only single learned is active in each iteration (as dataloader provide example from one cluster), there is not point in freezing the parameters. Also, in your example, and it is interesting. Currently there is single optimizer, not sure what would happen to history of gradient, when just single part of embedding is updated. The mean and variance is updated with 0 values. This sound intriguing. Also, single optimizer like SGD with Momentum may not work here great.

Overall, I think that we do not need to freeze any of learner as it will not have any gradient for dataset from different cluster. However, splitting to many optimizer maybe a good idea, worth trying

asanakoy commented 4 years ago

No need to freeze non-active learners, since the gradient is not calculated for them.

Although, good point about the moments in the optmizer. We use Adam which calculate exponential moving avearages of the second and first momets of the updates. I'm not sure that the moments are updated with zeros for inactive embedding dimensions. It depends on the implementation of the optimizer.

andre1q commented 4 years ago

Did you try your implementation? Look like it should work. About lacking this portion of code, I think that when we use only portion of embedding, then other parts does not have any gradient at all. And as only single learned is active in each iteration (as dataloader provide example from one cluster), there is not point in freezing the parameters. Also, in your example, and it is interesting. Currently there is single optimizer, not sure what would happen to history of gradient, when just single part of embedding is updated. The mean and variance is updated with 0 values. This sound intriguing. Also, single optimizer like SGD with Momentum may not work here great.

Overall, I think that we do not need to freeze any of learner as it will not have any gradient for dataset from different cluster. However, splitting to many optimizer maybe a good idea, worth trying

Thanks a lot. :) No, I didn`t try my implementation yet. Maybe, I will try just splitting the optimizer thing, since freezing is not necessary.

andre1q commented 4 years ago

No need to freeze non-active learners, since the gradient is not calculated for them.

Although, good point about the moments in the optmizer. We use Adam which calculate exponential moving avearages of the second and first momets of the updates. I'm not sure that the moments are updated with zeros for inactive embedding dimensions. It depends on the implementation of the optimizer.

Thanks a lot for your answer :)

CompVis / metric-learning-divide-and-conquer

freezing trainable parameters of other learners while training one separate learner #9