Open zaccharieramzi opened 2 years ago
The highest prio to me is:
@tomMoral @pierreablin wdyt?
RE LR scheduling: there exists a big difference in how TF and PL implement it. Basically, TF implements it on a per-optimizer-step level, and PL implements it on a per-epoch level (see here), which is similar that is done here or in timm.
I might just go with the PL way of doing it, since it's the least flexible.
RE LR scheduling / Weight Decay: I am not sure what is the canonical way of updating the weight decay given the lr schedule.
In TF, it is specified that it should be updated, and in this case manually:
Note: when applying a decay to the learning rate, be sure to manually apply the decay to the
weight_decay
as well. For example:
But in PL or torch, I didn't see any mention of this update, so it might not be used. I am going to verify this.
Ok so the problem with WD is actually the following, and I understood it reading the original decoupled weight decay paper as well as the docs of Adam and AdamW in PyTorch. There exists 2 ways of applying the weight decay:
I will call both types clearly in the solvers. We can still have coupled weight decay for PyTorch and TensorFlow, but for TensorFlow, the problem is that we need to hack it in a bit of an ugly way... I will make a proposal and we will see.
Data augmentation:
Regularization:
Learning rate:
Modeling (to me these ones are out of our scope):
Other: