Lower speed after using large batch size

facebookresearch / DeltaCNN

DeltaCNN End-to-End CNN Inference of Sparse Frame Differences in Videos

Other

60 stars 6 forks source link

Hi, authors! Thanks for your great work! But I have a question about the FPS at a large batch size. We have tested the latency at batchsize=1 on high-end GPUs, whose result is aligned with the reported speedup in Table 1. However, when we increase the batch size to 32 (or smaller, like 4, 16) as Table 1 does, the latency by dense or sparse inference is larger than cuDNN, which is against the reported results in Table 1. And the memory overhead is much larger than cuDNN. The experiment is conducted with YOLOv5s on MOT16, tested on a Tesla V100 GPU. The input size is set as (1088, 608), and we also tested the input size of (640, 640), whose result is similar. I'd appreciate it greatly if you could give some explanations!

Hi! Difficult to guess what the issue might be. But the most likely reason for that is that cuDNN uses TF32 in the background which makes it kind of an unfair comparison, because we did not implement Tensor Core support.

Can you test it again using Float32 as type and setting these two flags?

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

This should make the dense inference more similar to cuDNN.

Regarding the sparse inference: actually, this should in all cases be faster if the thresholds are well tuned and if you achieve a reasonable sparsity on the input. That said, I never tested DeltaCNN on YoloV5, but I don't see a reason why it should not perform well there. Did you use a greater threshold on the input than for the rest? What speedup do you get if you use a huge threshold just for testing?

Regarding memory consumption: yes, it is definitely higher, because we have to store the feature map at every non-linear layer. that's a trade-off of this method

facebookresearch / DeltaCNN

Lower speed after using large batch size #14