AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.64k stars 7.96k forks source link

Quantization Aware Training [Feature Request] #4362

Open LukeAI opened 4 years ago

LukeAI commented 4 years ago

This repo. has supported fp16 training and inference for some time.

Running inference at fp16 currently offers the best AP/FPS trade-off but I have found that training at fp16 results in a relative small degradation in accuracy. So currently I train a model at fp32 and then run in fp16 at runtime.

Tensorflow's Quantization Aware training allows you to train at full precision but with quantization effects accurately modeled - "The quantization error is modeled using fake quantization nodes to simulate the effect of quantization in the forward and backward passes. The forward-pass models quantization, while the backward-pass models quantization as a straight-through estimator. Both the forward- and backward-pass simulate the quantization of weights and activations. Note that during back propagation, the parameters are updated at high precision as this is needed to ensure sufficient precision in accumulating tiny adjustments to the parameters."

https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize

This supposedly gives you a boost in AP when running at lower precision compared to simply running an fp32-trained model at fp16.

AlexeyAB commented 4 years ago
LukeAI commented 4 years ago

what do you mean by, "if channels are a multiple of 8"? obviously most images have 3 channels.

AlexeyAB commented 4 years ago

I mean intermediate conv-layers. Only the first conv-layer has 3 channels (since image has 3 channels).

Each [convolutional] layer has:

https://github.com/AlexeyAB/darknet/blob/d43e09cdf24708b61cbd159822860dedbf756f1f/src/convolutional_kernels.cu#L420-L421

LukeAI commented 4 years ago

right! That's very good to know - because the pruned models generally don't have filters as a multiple of 8. I haven't done any speed benchmarking of them yet but that means that they'll be losing a lot of inf. speed. - could probably then get a better, more efficient model by just tweaking the filter numbers upwards to the next multiple of 8.

AlexeyAB commented 4 years ago

Yes, because nVidia Tensor Cores can't process [convolutional] layers if channels or filters are not multiple of 8.

LukeAI commented 4 years ago

hmm.... would it therefore give me a small performance improvement if I artificially increased the number of classes in my cfg to make the number of convolutional filters in the pre-YOLO conv layers a multiple of 8?

eg. currently I have filters=30 (classes=5) but I could change it to classes=11 (6 dummy classes) and therefore filters=48 This is x1.6 more FLOPS to do on those layers but if they are then processed by a tensorcore at x4 speed it should take x0.4 overall for those three pre-YOLO layers.

AlexeyAB commented 4 years ago

It can give a small acceleration.

But the bottleneck is in the 1st [conv] layer, it can take more than 20% of all execution time, since there is input channel = 3, and high width x height of layer.

fengxiuyaun commented 3 years ago

https://github.com/ArtyZe/yolo_quantization quantization-aware train