Sik-Ho Tang | Review: WRNs -- Wide Residual Networks (Image Classification).

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review: WRNs — Wide Residual Networks (Image Classification).

NorbertZheng commented 1 year ago

Overview

This time, WRNs (Wide Residual Networks) is presented. By widening Residual Network (ResNet), the network can be shallower with the same accuracy or improved accuracy. Shallower network means:

Number of layers can be reduced.
Training time can be shorter as well.

A better dropout is also investigated. This is a 2016 BMVC paper with more than 700 citations. Though this is a paper in the year of 2016, they still kept updating the paper in June 2017. Various ResNet Blocks.

NorbertZheng commented 1 year ago

Problems on Residual Network (ResNet)

Circuit Complexity Theory

The circuit complexity theory literature showing that:

shallow circuits can require exponentially more components than deeper circuits.

The authors of residual networks tried to make them as thin as possible in favor of increasing their depth and having less parameters, and even introduced a «bottleneck» block which makes ResNet blocks even thinner.

Diminishing Feature Reuse

However, As gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training, so it is possible that there is either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal.

This problem was formulated as diminishing feature reuse.

NorbertZheng commented 1 year ago

WRNs (Wide Residual Networks)

In WRNs, plenty of parameters are tested such as the design of the ResNet block, how deep (deepening factor $l$) and how wide (widening factor $k$) within the ResNet block.

When $k=1$, it has the same width of ResNet. While $k>1$, it is $k$ time wider than ResNet.

WRN-d-k: means the WRN has the depth of $d$ and with widening factor $k$.

Pre-Activation ResNet is used in CIFAR-10, CIFAR-100 and SVHN datasets. Original ResNet is used in ImageNet dataset.
The major difference is that Pre-Activation ResNet has a structure of performing batch norm and ReLU before convolution (i.e. BN-ReLU-Conv) while original ResNet has a structure of Conv-BN-ReLU. And Pre-Activation ResNet is generally better than the original one, but it has no obvious improvement in ImageNet when layers are only around 100.

NorbertZheng commented 1 year ago

The design of ResNet block

WRN-d-2 (k=2), Error Rate (%) in CIFAR-10 Dataset.

B(3;3): Original «basic» block, in the first figure (a).
B(3;1;3): With one extra 1×1 layer in between two 3×3 layers.
B(1;3;1): With the same dimensionality of all convolutions, «straightened» bottleneck.
B(1;3): The network has alternating 1×1, 3×3 convolutions.
B(3;1): The network has alternating 3×3, 1×1 convolutions.
B(3;1;1): Network-in-Network style block.

B(3;3) has the smallest error rate (5.73%).

Note: Number of depths (layers) are different is to keep the number of parameters close to each other.

NorbertZheng commented 1 year ago

Number of Convolutional Layers Within ResNet block

WRN-40–2 with different l, Error Rate (%) in CIFAR-10 Dataset.

And two 3×3 convolutions, i.e. B(3,3) has the smallest error rate than the others . Because as all networks need to be kept as near the same parameters,

B(3,3,3) and B(3,3,3,3) turns out to have fewer skip connection which makes accuracy dropped.
And B(3) has only one 3×3 convolution which makes the feature extraction ineffective within such shallow network within the ResNet block.

Thus, B(3,3) is optimal and gonna be used in the coming experiments.

NorbertZheng commented 1 year ago

Width of ResNet Blocks

Different Width (k) and Depth on CIFAR-10 and CIFAR-100.

All networks with 40, 22 and 16 layers see consistent gains when width is increased by 1 to 12 times.
On the other hand, when keeping the same fixed widening factor k = 8 or k = 10 and varying depth from 16 to 28 there is a consistent improvement, however when we further increase depth to 40 accuracy decreases.
Based on the results above, three sets of WRNs are chosen for comparison with state-of-the-art approaches.

NorbertZheng commented 1 year ago

Results

CIFAR-10 & CIFAR-100

CIFAR-10 & CIFAR-100.

WRN-40–4: Fewer parameters (8.9M) than 1001-layer Pre-Activation ResNet (10.2M). But it also got lower error rate as well. (4.52% on CIFAR-10 and 21.18% on CIFAR-100).
WRN-16-8 & WRN-28-10: Shallower than and wider than WRN-40–4, and got even lower error rate. With shallower network, training time can be shorter since parallel computations are performed on GPUs no matter how wide.
And it is the first paper to obtain lower than 20% for CIFAR-100 without any strong data augmentation!!!

NorbertZheng commented 1 year ago

Dropout

Dropout in Original ResNet (Left) and Dropout in WRNs (Right).

Dropout Is Better.

Top: With dropout, consistent gain is obtained for different depth, k, and datasets.
Bottom right: With dropout, the training loss is higher but test error is lower meaning that dropout reduce overfitting successfully.

NorbertZheng commented 1 year ago

ImageNet & COCO

Single Crop Single Model Validation Error, ImageNet.

The above networks obtain similar accuracy than the original one with 2 times fewer layers.

WRN-50–2-Bottleneck: Outperforms ResNet-152 and having 3 times fewer layers, which means the training time is significantly faster.
WRN-34–2: Outperforms ResNet-152 and Inception-v4-based models.

NorbertZheng commented 1 year ago

Training Time

Training Time for Each Batch with Batch Size of 32, CIFAR-10.

WRN-16–10 and WRN-28–10: The training time is much lower than the 1004-layer Pre-Activation ResNet, and having lower error rate.
WRN-40–4: The training time is lower than the 164-layer Pre-Activation ResNet, and having lower error rate.

NorbertZheng commented 1 year ago

Since training takes much time, it can take couples of days or even weeks. When the training set is gonna larger and larger, a better way to train is needed. Indeed, in recent research, many researchers are still focusing on how to reduce the training time or number of epoches for training.

In WRNs, it reduce the training time but with the expense of increasing the number of parameters due to the widening of the network.

NorbertZheng commented 1 year ago

References

[2016 BMVC] [WRNs] Wide Residual Networks.

NorbertZheng / read-papers