Open LukeAI opened 4 years ago
if CUDNN_HALF=1 there is used Mixed-prision FP16/FP32 rather than Half-precision FP16
Mixed-prision FP16/FP32 can be processed 4x-8x times faster on Tensor Cores for large tensors
Mixed-prision FP16/FP32 on Tensor Cores is enabled for inference if channels and filters are multiple of 8, except the first layer (if CUDNN_HALF=1). According to nVidia, this should not give any noticeable decrease in accuracy. Without Mixed-prision the Tensor Cores can't be used at all.
Currently Mixed-prision FP16/FP32 is disabled for training, because it reduces the mAP. May be even Mixed-prision isn't used in activation-functions, it seems it still requires Loss-scale, which are different for different models - this forces you to manually find a Loss-scale coefficient for each new model
what do you mean by, "if channels are a multiple of 8"? obviously most images have 3 channels.
I mean intermediate conv-layers. Only the first conv-layer has 3 channels (since image has 3 channels).
Each [convolutional] layer has:
right! That's very good to know - because the pruned models generally don't have filters as a multiple of 8. I haven't done any speed benchmarking of them yet but that means that they'll be losing a lot of inf. speed. - could probably then get a better, more efficient model by just tweaking the filter numbers upwards to the next multiple of 8.
Yes, because nVidia Tensor Cores can't process [convolutional] layers if channels or filters are not multiple of 8.
hmm.... would it therefore give me a small performance improvement if I artificially increased the number of classes in my cfg to make the number of convolutional filters in the pre-YOLO conv layers a multiple of 8?
eg. currently I have filters=30 (classes=5) but I could change it to classes=11 (6 dummy classes) and therefore filters=48 This is x1.6 more FLOPS to do on those layers but if they are then processed by a tensorcore at x4 speed it should take x0.4 overall for those three pre-YOLO layers.
It can give a small acceleration.
But the bottleneck is in the 1st [conv] layer, it can take more than 20% of all execution time, since there is input channel = 3, and high width x height of layer.
https://github.com/ArtyZe/yolo_quantization quantization-aware train
This repo. has supported fp16 training and inference for some time.
Running inference at fp16 currently offers the best AP/FPS trade-off but I have found that training at fp16 results in a relative small degradation in accuracy. So currently I train a model at fp32 and then run in fp16 at runtime.
Tensorflow's Quantization Aware training allows you to train at full precision but with quantization effects accurately modeled - "The quantization error is modeled using fake quantization nodes to simulate the effect of quantization in the forward and backward passes. The forward-pass models quantization, while the backward-pass models quantization as a straight-through estimator. Both the forward- and backward-pass simulate the quantization of weights and activations. Note that during back propagation, the parameters are updated at high precision as this is needed to ensure sufficient precision in accumulating tiny adjustments to the parameters."
https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize
This supposedly gives you a boost in AP when running at lower precision compared to simply running an fp32-trained model at fp16.