Our team build a network having 80.04%
accuracy on cifar-100 with 0.0030
Parameter Storage score and 0.0028
Math Operations scroe, achieveing the MicroNet Challenge score of 0.0058
.
The below figure is our proposed architecture for the cifar-100 dataset. The numbers described above the arrows are the shape of each input and output.
Our architecture consists of:
The details of Stem_Conv, Head_Conv, MBConvBlock, and Early Exiting Module are described below the 'Main network'.
Config/reproduce.json
)figure 1
First of all, we search for a baseline architecture suitable for cifar-100 data set based on the EfficientNet architecture using autoML. The search process is as follows:
Block arguments search: In this step, we search the number of MBConvBlock, and kernel size(k), stride(s), expansion ratio(e), input channels(i), output channels(o), and squeeze-expansion ratio(se) in each block. From the results of the block arguments search, we find out that the position of the convolutional layer which serves to reduce resolution, or convolutional layer with stride of 2, is a sensitive factor to accuracy. With this inference, after several hand-made experiments, the above architecture is chosen.
Scaling coefficients search: In this step, after block aurgments are decided, we search three coefficients by adjusting available resources: width, depth, and resolution. Actually, we set the depth coefficient as 1 since its slight change gets even worse in terms of score. Therefore, a resolution coefficient is set randomly within a given range according to the available resources, and then a width coefficient is calculated by [available resources / resolution coefficient^2]. From the results of the scaling coefficients search, we find out that a large resolution coefficient make a greater performance improvement than a large width coefficient under our circumstance. As a result, when we set available resources as 2, we get a resolution coefficient of 1.4. Finally, to lighten this model, we decide a width coefficient as 0.9, and adapt these coefficients to the model we've got via block arguments search.
AutoAugment
except Cutout
and SamplePairing
. Please refer to AutoML_autoaug.py
for the process and data_utils/autoaugment.py
for the policy we've got.After training the main network, we adapt layer-wise normalized magnitude-based iterative pruning method.
We prune 64%
from whole weights in the following steps:
Although general CNN models have the static computational graph for whole dataset, the desirable computational cost for each sample can differ.
By exiting certain samples earlier, considerable FLOPs can be saved without significant accuracy degradation.
To make this idea come true, we designed early exiting module in the following steps:
early exiting module
at this position and trained it.
threshold confidence
to decide when the samples exit. If the maximum probability for a certain class via early exiting module is greater than threshold confidence, the sample exit and do not go to the end of the main network.early exiting module
.* soft smoothing loss function
The table below shows trade-off at each position when threshold confidence is 0.85, 0.88, and 0.9. (This results are obtained from 60% pruned model with varying early exiting positions.)
Exiting Position | Exiting Path FLOPs* | Added Param # | Early Exiting Ratio | Accuracy** | Score |
---|---|---|---|---|---|
MBConvBlock[2] | 25,512,831 (21.19%) | 19,500 | 20.43%, 18.05%, 16.25% | 79.94%, 80.03%, 80.09% (55.81%) | 0.006371, 0.006437, 0.006487 |
MBConvBlock[3] | 35,610,263 (29.58%) | 19,500 | 22.92%, 20.24%, 18.33% | 80.26%, 80.35%, 80.38% (58.66%) | 0.006371, 0.006437, 0.006485 |
MBConvBlock[4] | 47,043,295 (38.91%) | 20,660 | 36.31%, 33.27%, 31.11% | 79.94%, 80.11%, 80.19% (64.62%) | 0.006208, 0.006274, 0.006321 |
MBConvBlock[5] | 64,294,963 (52.72%) | 22,980 | 50.42%, 47.05%, 44.92% | 79.96%, 80.00%, 80.14% (69.85%) | 0.006249, 0.006306, 0.006342 |
From the above result, we chose to use MBConvBlock[4] as the exiting position and applied to the final pruned main network.
* Exiting Path FLOPs(%) means that the number of math operations for the sample exiting in the middle. And, the percentage in parenthesis means Exiting Path FLOPs / Total Path FLOPs
.
e.g.) If the exiting point is MBConvBlock[2],
** The accuracy in parentheses is the accuracy with threshold confidence of 0.0. (i.e., all samples exit via the early exiting module.)
At last, we add early exiting module
on MBConvBlock 4th position of 64% pruned model.
And also, we pruned that module to 50% sparsity and confirm that there is no accuracy drop in 0.85 confidence level.
* Batchnorm Stablization Phenomena
When an early exiting module is trained, the main network is frozen but batchnrom buffers(running mean and running variance) in the main network are updated during early exiting module training.
We observed the phenomena of increasing the accuracy of main network. We call this phenomena batchnorm stablization
.
We conjecture that the training methods such as mixup affect input data to have different distribution with test data distribution.
So updating batchnorm buffers by showing inputs without mixup seems to have stablizing effect on batchnorm buffers.
This phenomena have reproduced in most of our experiment and it was not helpful at all when we didn't use mixup to the main network training.
The table below describes the number of parameters and the number of operations of our model on a 32-bit basis, which is obtained by hand. This table is calculated without counting Batchnorm Params & Ops. However, we consider Batchnorm counting as the bias of the previous convolution layer when we get the score.
Our score is calculated on 16-bit input, parameter, and 32-bit accumulator.
Input |
Operator | k | s | e | i | o | se | Parameter Storage | MULTI | ADD | Math Operations |
---|---|---|---|---|---|---|---|---|---|---|---|
32*32*3 | Upsample(nearest) | - | - | - | - | - | - | 0 | 11,907 | 0 | 11,907 |
63*63*3 | Stem_Conv2d | 3 | 2 | - | 3 | 24 | - | 648 | 691,920 | 622,728 | 1,314,648 |
31*31*24 | MBConvBlock[0] | 3 | 1 | 1 | 24 | 16 | 0.20 | 820 | 669,132 | 584,484 | 1,253,616 |
31*31*16 | MBConvBlock[1] | 3 | 1 | 6 | 16 | 24 | 0.20 | 5,379 | 5,167,209 | 4,590,315 | 9,757,524 |
31*31*24 | MBConvBlock[2] | 3 | 2 | 6 | 24 | 40 | 0.20 | 11,812 | 5,455,164 | 4,933,372 | 10,388,536 |
15*15*40 | MBConvBlock[3] | 3 | 1 | 6 | 40 | 40 | 0.20 | 25,448 | 5,188,584 | 4,908,848 | 10,097,432 |
15*15*40 | MBConvBlock[4] | 3 | 1 | 6 | 40 | 48 | 0.20 | 27,368 | 5,620,584 | 5,285,048 | 10,905,632 |
15*15*48 | MBConvBlock[5] | 3 | 1 | 6 | 48 | 64 | 0.20 | 40,329 | 8,300,475 | 7,896,393 | 16,196,868 |
15*15*64 | MBConvBlock[6] | 3 | 1 | 6 | 64 | 64 | 0.20 | 62,220 | 12,452,004 | 12,004,428 | 24,456,432 |
15*15*64 | MBConvBlock[7] | 3 | 2 | 6 | 64 | 80 | 0.20 | 68,364 | 7,549,092 | 7,228,348 | 14,777,440 |
7*7*80 | MBConvBlock[8] | 3 | 1 | 6 | 80 | 80 | 0.20 | 96,976 | 4,156,368 | 4,033,376 | 8,189,744 |
7*7*80 | MBConvBlock[9] | 3 | 1 | 6 | 80 | 96 | 0.20 | 104,456 | 4,532,688 | 4,385,392 | 8,918,080 |
7*7*96 | Head_Conv2d | 1 | 1 | - | 96 | 136 | - | 13,056 | 659,736 | 639,744 | 1,299,480 |
7*7*136 | AveragePool | 7 | - | - | - | - | - | 0 | 136 | 6,528 | 6,664 |
136 | FullyConnected | - | - | - | - | - | _ | 13,700 | 13,600 | 13,600 | 27,200 |
100 | - | - | - | - | - | - | - | - | - | - | - |
Total | - | - | - | - | - | - | - | 470,776 | 60,456,692 | 57,132,604 | 117,589,296 |
Input |
Operator | Parameter Storage | MULTI | ADD | Math Operations |
---|---|---|---|---|---|
15*15*40 | Early Exiting | 20,660 | 1,717,436 | 1,596,564 | 3,314,000 |
Total | - | 491,436 | 62,174,128 | 58,729,168 | 120,903,296 |
./reproduce.sh
# For reproducing, run bash file.python main.py ./Config/test.json
# For testing our final checkpoint@inproceedings{lee2020sipa,
title={SIPA: A simple framework for efficient networks},
author={Lee, Gihun and Bae, Sangmin and Oh, Jaehoon and Yun, Se-Young},
booktitle={2020 International Conference on Data Mining Workshops (ICDMW)},
pages={729--736},
year={2020},
organization={IEEE}
}
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [No.2018-0-00278,Development of Big Data Edge Analytics SW Technology for Load Balancing and Active Timely Response].