MicroNet Challenge (Team: OSI AI)

Challenge Leaderboard

Our team build a network having 80.04% accuracy on cifar-100 with 0.0030 Parameter Storage score and 0.0028 Math Operations scroe, achieveing the MicroNet Challenge score of 0.0058.

1. Overview

The below figure is our proposed architecture for the cifar-100 dataset. The numbers described above the arrows are the shape of each input and output.
Our architecture consists of:

Main Network
1. Upsample Layer
2. Stem_Conv
3. 10 * MobileNet V2 Convolution Block (MBConvBlock)
4. Head_Conv
5. Global Average Pooling
6. Fully Connected Layer
Early Exiting Module

The details of Stem_Conv, Head_Conv, MBConvBlock, and Early Exiting Module are described below the 'Main network'.

In addition, in MBConvBlock[0], there is no the first three layers (Expansion_Conv, BatchNorm, Activation Function) in a block since there is no expansion when e=1.

2. Our Approach Detail

2-0. Configuration (Please refer to `Config/reproduce.json`)

Data & Model precision
- 16 bits
Data
- Dataset: cifar-100
- Batch size: 128
- Train size/Valid size: 50000/0
- Augmentation: [random crop 32*32 with padding of 4, random horizontal flip(p=0.5), normalization] + (custom) auto augmentation for cifar-100 + Mixup
Model
- Architecture: See figure 1
- Activation function: swish (beta=1)
- Batch normalization: ghost batch normalization (splits=4)
- Optimizer: sgd (lr=0.13, weight_decay=1e-5, momentum=0.9)
- Loss function: cross entropy loss with label smoothing (smoothing factor=0.3)
- Learning rate scheduler: cosine annealing scheduler (T_max=1200, without restart)
- Epochs #: 1200
Pruning
- Pruning method(one shot/iterative): iterative
- Desired sparsity/Pruning ratio per iteration: 64% / 10%*5+2.5%*4+2%*2
- Epochs # per pruning iteration: 600
- Optimizer: sgd (lr=0.13, weight_decay=1e-5, momuntum=0.9)
- Loss function: cross entropy loss with label smoothing (smoothing factor=0.3)
- Learning rate scheduler: cosine annealing scheduler (T_max=600, without restart)
- Weight reset: False
- Normalization: Layer-wise magnitude normalization
Early Exiting
- Epochs #: 800
- Optimizer: sgd (lr=0.13, weight_decay=1e-5, momuntum=0.9)
- Loss function: soft smoothing loss
- Learning rate scheduler: cosine annealing scheduler (T_max=800, without restart)
- Augmentation: [random crop 32*32 with padding of 4, random horizontal flip(p=0.5), normalization]
Pruning Early Exiting
- Pruning method(one shot/iterative): iterative
- Desired sparsity/Pruning ratio per iteration: 50% / 10%*5
- Epochs # per pruning iteration: 600
- Optimizer: sgd (lr=0.13, weight_decay=1e-5, momuntum=0.9)
- Loss function: cross entropy loss with label smoothing (smoothing factor=0.3)
- Learning rate scheduler: cosine annealing scheduler (T_max=600, without restart)
- Weight reset: False
- Normalization: Layer-wise magnitude normalization

2-1. Architecture Search

First of all, we search for a baseline architecture suitable for cifar-100 data set based on the EfficientNet architecture using autoML. The search process is as follows:

Block arguments search: In this step, we search the number of MBConvBlock, and kernel size(k), stride(s), expansion ratio(e), input channels(i), output channels(o), and squeeze-expansion ratio(se) in each block. From the results of the block arguments search, we find out that the position of the convolutional layer which serves to reduce resolution, or convolutional layer with stride of 2, is a sensitive factor to accuracy. With this inference, after several hand-made experiments, the above architecture is chosen.
Scaling coefficients search: In this step, after block aurgments are decided, we search three coefficients by adjusting available resources: width, depth, and resolution. Actually, we set the depth coefficient as 1 since its slight change gets even worse in terms of score. Therefore, a resolution coefficient is set randomly within a given range according to the available resources, and then a width coefficient is calculated by [available resources / resolution coefficient^2]. From the results of the scaling coefficients search, we find out that a large resolution coefficient make a greater performance improvement than a large width coefficient under our circumstance. As a result, when we set available resources as 2, we get a resolution coefficient of 1.4. Finally, to lighten this model, we decide a width coefficient as 0.9, and adapt these coefficients to the model we've got via block arguments search.

2-2. Techniques for Improvement

Auto augmentation: We search 25 sub-policies for cifar-100 data set based on the augmentation search space in AutoAugment except Cutout and SamplePairing. Please refer to AutoML_autoaug.py for the process and data_utils/autoaugment.py for the policy we've got.
Mixup: We add a Mixup technique with alpha of 1, which is the hyperparameter for beta-distribution, after auto augmentation. We thought that this augmentation can help inter-exploration between arbitrary two classes.
No bias decay: We do not apply weight decay regularizer to biases. Since these part has a small percentage of the total, it can make underfitting.
Swish activation function: We use a Swish activation function with beta of 1, which is x * sigmoid(x). This activation function is usually interpreted as a self-gate activation.
Ghost batch normalization: We use ghost batch normalization, where batch is divided into four smaller ghost batch in our case to match the splited batch size to 32, instead of plain batch normalization.
Label smoothing: We use a label smoothing technique through which the probability of the correct label is assinged as 0.7, and 0.3/99 for the others.
Cosine annealing scheduler: We use cosine annealing scheduler for adaptive learning rate, and set a period of one cycle as the number of epochs. Hence, there is no restart process.

2-3. Pruning

After training the main network, we adapt layer-wise normalized magnitude-based iterative pruning method.
We prune 64% from whole weights in the following steps:

10% pruning * 5 times
2.5% pruning * 4 times
2% pruning * 2 times

2-4. Early Exiting

Although general CNN models have the static computational graph for whole dataset, the desirable computational cost for each sample can differ.
By exiting certain samples earlier, considerable FLOPs can be saved without significant accuracy degradation.

To make this idea come true, we designed early exiting module in the following steps:

We selectd MBConvBlock[2] as the exiting position.
We added early exiting module at this position and trained it.
- To ensure the performance of the main network to be preserved, we freeze the pruned model and update the parameters of early exiting module only.
- Data
  - Same with the data configuration for the main network
  - Excpet not using (custom) auto augment & Mixup
- Model
  - Same with the optimization configuration for the main network
  - Except not using label smoothing, epochs # is 800, and using soft smoothing loss function*.
We introduce threshold confidence to decide when the samples exit. If the maximum probability for a certain class via early exiting module is greater than threshold confidence, the sample exit and do not go to the end of the main network.
We checked the trade-off by changing the exiting position.
We selected the best exiting position and pruned early exiting module.

* soft smoothing loss function

We propose a new loss function, a new confidence-aware loss function. Generally, the maximum softmax output value from the network is considered as confidence of the inference result. However, many papers pointed out that it shows over-confidence that outputs too large confidence values even for the uncertain samples.
By conducting weighted average over loss of each sample using maximum softmax output, it gives large loss when the result has high confidence. The penalty given to the high confidence evades the situation that the network is over-confident in uncertain situation.
So we multiplied (1 + max_softmax_output) value to cross entropy of each sample. Here, adding 1 makes the loss function act as monotonically increasing which ensures that learning direction is same with our intention.

The table below shows trade-off at each position when threshold confidence is 0.85, 0.88, and 0.9. (This results are obtained from 60% pruned model with varying early exiting positions.)

Exiting Position	Exiting Path FLOPs*	Added Param #	Early Exiting Ratio	Accuracy**	Score
MBConvBlock[2]	25,512,831 (21.19%)	19,500	20.43%, 18.05%, 16.25%	79.94%, 80.03%, 80.09% (55.81%)	0.006371, 0.006437, 0.006487
MBConvBlock[3]	35,610,263 (29.58%)	19,500	22.92%, 20.24%, 18.33%	80.26%, 80.35%, 80.38% (58.66%)	0.006371, 0.006437, 0.006485
MBConvBlock[4]	47,043,295 (38.91%)	20,660	36.31%, 33.27%, 31.11%	79.94%, 80.11%, 80.19% (64.62%)	0.006208, 0.006274, 0.006321
MBConvBlock[5]	64,294,963 (52.72%)	22,980	50.42%, 47.05%, 44.92%	79.96%, 80.00%, 80.14% (69.85%)	0.006249, 0.006306, 0.006342

From the above result, we chose to use MBConvBlock[4] as the exiting position and applied to the final pruned main network.

* Exiting Path FLOPs(%) means that the number of math operations for the sample exiting in the middle. And, the percentage in parenthesis means Exiting Path FLOPs / Total Path FLOPs. e.g.) If the exiting point is MBConvBlock[2],

Exiting Path FLOPs = From Upsample to MBConvBlock[2] FLOPs + Early Exiting FLOPs
Total Path FLOPs = Main Network FLOPs + Early Exiting FLOPs

** The accuracy in parentheses is the accuracy with threshold confidence of 0.0. (i.e., all samples exit via the early exiting module.)

At last, we add early exiting module on MBConvBlock 4th position of 64% pruned model.
And also, we pruned that module to 50% sparsity and confirm that there is no accuracy drop in 0.85 confidence level.

* Batchnorm Stablization Phenomena
When an early exiting module is trained, the main network is frozen but batchnrom buffers(running mean and running variance) in the main network are updated during early exiting module training.
We observed the phenomena of increasing the accuracy of main network. We call this phenomena batchnorm stablization.

We conjecture that the training methods such as mixup affect input data to have different distribution with test data distribution.
So updating batchnorm buffers by showing inputs without mixup seems to have stablizing effect on batchnorm buffers.
This phenomena have reproduced in most of our experiment and it was not helpful at all when we didn't use mixup to the main network training.

3. Scoring metric

The table below describes the number of parameters and the number of operations of our model on a 32-bit basis, which is obtained by hand. This table is calculated without counting Batchnorm Params & Ops. However, we consider Batchnorm counting as the bias of the previous convolution layer when we get the score.
Our score is calculated on 16-bit input, parameter, and 32-bit accumulator.

Before pruning:
- Parameter Storage (Score): 0.006534
- Math Operation (Score): 0.008447
- Therefore, score is 0.014981
After pruning:
- 64% pruning
- Mask storage for pruning is included as 1-bits.
- Parameter Storage (Score): 0.002833
- Math Operation (Score): 0.003267
- Therefore, score is 0.006100
After Early Exiting:
- Threshold confidence: 0.85
- Early Exiting Ratio: 32.17%
- Parameter Storage (Score): 0.003119
- Math Operation (Score): 0.002858
- Therefore, score is 0.005977
After Pruning Early Exiting:
- 50% pruning
- Mask storage for pruning is included as 1-bits.
- Threshold confidence: 0.85
- Early Exiting Ratio: 29.29%
- Parameter Storage (Score): 0.002995
- Math Operation (Score): 0.002803
- Therefore, score is 0.005798

Input	Operator	k	s	e	i	o	se	Parameter Storage	MULTI	ADD	Math Operations
32323	Upsample(nearest)	-	-	-	-	-	-	0	11,907	0	11,907
63633	Stem_Conv2d	3	2	-	3	24	-	648	691,920	622,728	1,314,648
313124	MBConvBlock[0]	3	1	1	24	16	0.20	820	669,132	584,484	1,253,616
313116	MBConvBlock[1]	3	1	6	16	24	0.20	5,379	5,167,209	4,590,315	9,757,524
313124	MBConvBlock[2]	3	2	6	24	40	0.20	11,812	5,455,164	4,933,372	10,388,536
151540	MBConvBlock[3]	3	1	6	40	40	0.20	25,448	5,188,584	4,908,848	10,097,432
151540	MBConvBlock[4]	3	1	6	40	48	0.20	27,368	5,620,584	5,285,048	10,905,632
151548	MBConvBlock[5]	3	1	6	48	64	0.20	40,329	8,300,475	7,896,393	16,196,868
151564	MBConvBlock[6]	3	1	6	64	64	0.20	62,220	12,452,004	12,004,428	24,456,432
151564	MBConvBlock[7]	3	2	6	64	80	0.20	68,364	7,549,092	7,228,348	14,777,440
7780	MBConvBlock[8]	3	1	6	80	80	0.20	96,976	4,156,368	4,033,376	8,189,744
7780	MBConvBlock[9]	3	1	6	80	96	0.20	104,456	4,532,688	4,385,392	8,918,080
7796	Head_Conv2d	1	1	-	96	136	-	13,056	659,736	639,744	1,299,480
77136	AveragePool	7	-	-	-	-	-	0	136	6,528	6,664
136	FullyConnected	-	-	-	-	-	_	13,700	13,600	13,600	27,200
100	-	-	-	-	-	-	-	-	-	-	-
Total	-	-	-	-	-	-	-	470,776	60,456,692	57,132,604	117,589,296

Input	Operator	Parameter Storage	MULTI	ADD	Math Operations
151540	Early Exiting	20,660	1,717,436	1,596,564	3,314,000
Total	-	491,436	62,174,128	58,729,168	120,903,296

3-1. Parameter Storage

3-2. Math Operations

4. Reproduce Process

./reproduce.sh # For reproducing, run bash file.
python main.py ./Config/test.json # For testing our final checkpoint
You can access checkpoints for all steps at https://bit.ly/2pYSFYg

Citing this work

@inproceedings{lee2020sipa,
  title={SIPA: A simple framework for efficient networks},
  author={Lee, Gihun and Bae, Sangmin and Oh, Jaehoon and Yun, Se-Young},
  booktitle={2020 International Conference on Data Mining Workshops (ICDMW)},
  pages={729--736},
  year={2020},
  organization={IEEE}
}

Acknowledgement

This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [No.2018-0-00278,Development of Big Data Edge Analytics SW Technology for Load Balancing and Active Timely Response].

Lee-Gihun / MicroNet_OSI-AI

readme

MicroNet Challenge (Team: OSI AI)

1. Overview

2. Our Approach Detail