Quantization Aware Training [Feature Request]

LukeAI commented 4 years ago

This repo. has supported fp16 training and inference for some time.

Running inference at fp16 currently offers the best AP/FPS trade-off but I have found that training at fp16 results in a relative small degradation in accuracy. So currently I train a model at fp32 and then run in fp16 at runtime.

Tensorflow's Quantization Aware training allows you to train at full precision but with quantization effects accurately modeled - "The quantization error is modeled using fake quantization nodes to simulate the effect of quantization in the forward and backward passes. The forward-pass models quantization, while the backward-pass models quantization as a straight-through estimator. Both the forward- and backward-pass simulate the quantization of weights and activations. Note that during back propagation, the parameters are updated at high precision as this is needed to ensure sufficient precision in accumulating tiny adjustments to the parameters."

https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize

This supposedly gives you a boost in AP when running at lower precision compared to simply running an fp32-trained model at fp16.

AlexeyAB commented 4 years ago

if CUDNN_HALF=1 there is used Mixed-prision FP16/FP32 rather than Half-precision FP16
Mixed-prision FP16/FP32 can be processed 4x-8x times faster on Tensor Cores for large tensors
Mixed-prision FP16/FP32 on Tensor Cores is enabled for inference if channels and filters are multiple of 8, except the first layer (if CUDNN_HALF=1). According to nVidia, this should not give any noticeable decrease in accuracy. Without Mixed-prision the Tensor Cores can't be used at all.
Currently Mixed-prision FP16/FP32 is disabled for training, because it reduces the mAP. May be even Mixed-prision isn't used in activation-functions, it seems it still requires Loss-scale, which are different for different models - this forces you to manually find a Loss-scale coefficient for each new model

LukeAI commented 4 years ago

what do you mean by, "if channels are a multiple of 8"? obviously most images have 3 channels.

AlexeyAB commented 4 years ago

I mean intermediate conv-layers. Only the first conv-layer has 3 channels (since image has 3 channels).

Each [convolutional] layer has:

input channels
output channels == number of filters

https://github.com/AlexeyAB/darknet/blob/d43e09cdf24708b61cbd159822860dedbf756f1f/src/convolutional_kernels.cu#L420-L421

LukeAI commented 4 years ago

right! That's very good to know - because the pruned models generally don't have filters as a multiple of 8. I haven't done any speed benchmarking of them yet but that means that they'll be losing a lot of inf. speed. - could probably then get a better, more efficient model by just tweaking the filter numbers upwards to the next multiple of 8.

AlexeyAB commented 4 years ago

Yes, because nVidia Tensor Cores can't process [convolutional] layers if channels or filters are not multiple of 8.

LukeAI commented 4 years ago

hmm.... would it therefore give me a small performance improvement if I artificially increased the number of classes in my cfg to make the number of convolutional filters in the pre-YOLO conv layers a multiple of 8?

eg. currently I have filters=30 (classes=5) but I could change it to classes=11 (6 dummy classes) and therefore filters=48 This is x1.6 more FLOPS to do on those layers but if they are then processed by a tensorcore at x4 speed it should take x0.4 overall for those three pre-YOLO layers.

AlexeyAB commented 4 years ago

It can give a small acceleration.

But the bottleneck is in the 1st [conv] layer, it can take more than 20% of all execution time, since there is input channel = 3, and high width x height of layer.

fengxiuyaun commented 3 years ago

https://github.com/ArtyZe/yolo_quantization quantization-aware train

AlexeyAB / darknet

Quantization Aware Training [Feature Request] #4362