Open JohnMBrandt opened 1 week ago
Hi,
Thank you for the code pointer! If the added head is randomly initialized, you could set the corresponding 'pre': params_anchor to zero tensors. In this case, AdamSPD reduces to a learnable normal weight decay. Keep me posted on your results and discoveries. I am interested in seeing how generalizable this method is to different settings.
Best,
Thank you so much for this research. I've wondered for a long time whether weight decay was leading to suboptimal results when fine-tuning transformers.
I work on fine-tuning vision transformers, mostly within MMDetection and MMSegmentation, and have successfully ported this work to those toolkits.
I was wondering, though, how you suggest applying the optimizer when attaching a new head to a pretrained backbone? Your work is suggesting that only a few layers need to be adjusted, but the entirety of the head needs to be adjusted. Is there a way to use normal AdamW with constant weight decay on the head, and AdamSPD with variable weight decay on the backbone? or does it matter?
My approach has been to:
Modify 'params' to take the parameter name:
Selectively apply SPD if the parameter name includes 'backbone'
Also, if interested, here is the MMEngine constructor that works to port it to MMSegmentation + MMDetection