Tips and Tricks for training classification convolutional neural networks

zaccharieramzi commented 2 years ago

Data augmentation:

[ ] Random Resized Crop
[x] Horizontal Flip
[ ] Random Augment (see code, paper ), with e.g. 7/0.5. Basically it is a fixed set of data augmentation functions.
[ ] Color Jitter
[ ] PCA lighting
[ ] Random Erasing
[ ] Mixup
[ ] Cutmix

Regularization:

[ ] Label smoothing (basically making the target of classif not (1, 0, 0 ,0) but (1-e, e/3, e/3, e/3)
[ ] Repeated Augmentation (multiple instances of a sample in the same batch, with different augmentation)
[ ] stochastic depth (basically drop some blocks and replace them with identity stochastically). It sounds very complicated to use as part of a solver in our setup. Moreover, the "ResNet strikes back" paper claims that this is only a good idea for very big networks, so it might not be our priority.
[x] weight decay (according to "Bag of tricks" this might only be useful for non-BN params). I have seen somewhere (but don't remember where exactly) that it should also only be applied to the weights and not the biases.

Learning rate:

[ ] Warmup: scale linearly the learning rate from 0 to the initial in the first few (5) epochs.
[x] Scheduling: step or cosine. The problem with the cosine schedule is that it needs to know the total number of epochs, which in the current setting is not available to the solver. Could we work around that @tomMoral ?

Modeling (to me these ones are out of our scope):

[ ] Zero gamma: make the initialization of the learned scale of BN layers to 0.
[ ] Layer Scale (basically learn a multiplicative factor per channel at the end of residual blocks)

Other:

[ ] mixed precision
[ ] weight averaging (SWA or EMA)
[ ] Binary Xent rather than categorical Xent, coupled with Cutmix and Mixup, in a 1-vs-all fashion
[ ] gradient clipping (not used in "ResNet strikes back")

zaccharieramzi commented 2 years ago

The highest prio to me is:

Crop
flip
schedule
WD
RandAug
mixed precision
Binary Xent (with Mixup cutmix)

@tomMoral @pierreablin wdyt?

zaccharieramzi commented 2 years ago

RE LR scheduling: there exists a big difference in how TF and PL implement it. Basically, TF implements it on a per-optimizer-step level, and PL implements it on a per-epoch level (see here), which is similar that is done here or in timm.

I might just go with the PL way of doing it, since it's the least flexible.

zaccharieramzi commented 2 years ago

RE LR scheduling / Weight Decay: I am not sure what is the canonical way of updating the weight decay given the lr schedule.

In TF, it is specified that it should be updated, and in this case manually:

Note: when applying a decay to the learning rate, be sure to manually apply the decay to the weight_decay as well. For example:

But in PL or torch, I didn't see any mention of this update, so it might not be used. I am going to verify this.

EDIT

Ok so the problem with WD is actually the following, and I understood it reading the original decoupled weight decay paper as well as the docs of Adam and AdamW in PyTorch. There exists 2 ways of applying the weight decay:

coupled: this is what PyTorch does off-the-shelf when using the classic operators (like here for Adam). This basically corresponds to L2 regularization of all the weights (with a factor 2 somewhere). For TensorFlow, this would correspond to adding an L2 regularization of half the value.
decoupled: this is what both TF and PT do with AdamW and SGDW. According to the decoupled weight decay paper it is better than coupling in the case of adaptive gradients.

I will call both types clearly in the solvers. We can still have coupled weight decay for PyTorch and TensorFlow, but for TensorFlow, the problem is that we need to hack it in a bit of an ugly way... I will make a proposal and we will see.

benchopt / benchmark_resnet_classif

Tips and Tricks for training classification convolutional neural networks #11

EDIT