Edge computing device deployment issues

Hello,

Thanks for sharing your excellent work!

I am trying to deploy DSVT algorithm to edge computing devices.

I referenced DSVT-AI-TRT and the initial TensorRT code that you provided. Then, I am deploying it using TensorRT with NVIDIA T4 compute card in the testing phase and the average time taken for inference is around 230ms@FP32 and 120ms@FP16, which gives relatively acceptable results.

However, when we tried to deploy this on the NVIDIA Jetson Xavier NX, the average time for inference using TensorRT was 1465ms@FP32 and 709ms@FP16, as shown in the figures below. Data shows that it takes longer in inference and postprocess phases , which is far from our desired goal of "real-time inference".

After a data search, I found that RTX3090 has 35.6 TFLOPs, T4 has 16.1 TFLOPs, and Xavier NX has 6.8 TFLOPs. I don't quite understand the difference in computility between RTX3090 and T4 is 2.2x, but it brings 10x elapsed time(calculate using your data), and the computility between T4 and Xavier is 2.36x, and it brings 7x elapsed time. Why is the gap so large? Are there any ideas to help us reduce the elapsed time of deploying this algorithm on Xavier NX?

Thank you very much!

Haiyang-W / DSVT

Edge computing device deployment issues #66