Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"
https://arxiv.org/abs/2301.06051
Apache License 2.0
353 stars 28 forks source link

Edge computing device deployment issues #66

Closed cyhasuka closed 7 months ago

cyhasuka commented 7 months ago

Hello,

Thanks for sharing your excellent work!

I am trying to deploy DSVT algorithm to edge computing devices.

I referenced DSVT-AI-TRT and the initial TensorRT code that you provided. Then, I am deploying it using TensorRT with NVIDIA T4 compute card in the testing phase and the average time taken for inference is around 230ms@FP32 and 120ms@FP16, which gives relatively acceptable results.

However, when we tried to deploy this on the NVIDIA Jetson Xavier NX, the average time for inference using TensorRT was 1465ms@FP32 and 709ms@FP16, as shown in the figures below. Data shows that it takes longer in inference and postprocess phases , which is far from our desired goal of "real-time inference".

image image

After a data search, I found that RTX3090 has 35.6 TFLOPs, T4 has 16.1 TFLOPs, and Xavier NX has 6.8 TFLOPs. I don't quite understand the difference in computility between RTX3090 and T4 is 2.2x, but it brings 10x elapsed time(calculate using your data), and the computility between T4 and Xavier is 2.36x, and it brings 7x elapsed time. Why is the gap so large? Are there any ideas to help us reduce the elapsed time of deploying this algorithm on Xavier NX?

Thank you very much!

Haiyang-W commented 7 months ago

Sorry for that, we have only tried it on RTX 3090 and A100; perhaps it is related to TensorRT and the internal implementation in PyTorch? Unfortunately, we are not experts in this field, so we may not be able to provide you with valuable advice.