Closed Chitti21 closed 3 years ago
The higher the parallelism of devices, the less the impact of bflops on speed.
Scaled YOLOv4 utilizes massively parallel devices such as GPUs much more efficiently than EfficientDet. For example, GPU V100 (Volta) has performance: 14 TFLops — 112 TFLops-Tensor-Cores https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf If we test both models on GPU V100 with batch = 1 -hparams = mixed_precision = true And without -tensorrt = FP32 Then:
YOLOv4-CSP (640x640) — 47.5% AP — 70 FPS — 120 BFlops (60 FMA) Based on BFlops, it should be 933 FPS = (112,000 / 120), but in fact we get 70 FPS, i.e. 7.5% GPU used = (70/933)
EfficientDetD3 (896x896) — 47.5% AP — 36 FPS — 50 BFlops (25 FMA) Based on BFlops, it should be 2240 FPS = (112,000 / 50), but in fact we get 36 FPS, i.e. 1.6% GPU used = (36/2240)
Those. efficiency of computing operations on devices with massive parallel computing such as GPUs used in YOLOv4-CSP (7.5 / 1.6) = 4.7x better than the efficiency of operations used in EfficientDetD3. Usually, neural networks are run on the CPU only in research tasks for easier debugging, and the BFlops characteristic is currently only of academic interest. In real-world tasks, real speed and accuracy are important. The real speed of YOLOv4-P6 is 3.7x faster than EfficientDetD7 on GPU V100. Therefore, devices with massive parallelism GPU / NPU / TPU / DSP with much more optimal speed, price and heat dissipation are almost always used
Hi...
Can someone clarify the effect of BFLOPS on the detection speed of the networks?
Yolo v3 has 65.9 BFLOPS Yolo v4 has 128.5 BFLOPS
But still the fps of Yolo v4 is better than Yolo v3.
What makes Yolo v4 better in terms of detection speed?