ceccocats / tkDNN

Deep neural network library and toolkit to do high performace inference on NVIDIA jetson platforms
GNU General Public License v2.0
718 stars 208 forks source link

Unable to replicate fps results on AGX Xavier #275

Open lpkoh opened 2 years ago

lpkoh commented 2 years ago

Hi,

I am using an AGX Xavier. I followed the instructions to run the demo for 2d object detection. I built a yolo4_fp16.rt model, which would be a 416x416 model. I then ran ./demo yolo4_fp16.rt, with batch = 1, and received an fps of ~9. This is significantly less than the ~ 41 FPS reported. Images are shown below:

image

image

I have no other background processes running. I do not have CUDA_VISIBLE_DEVICES set to anything. My nvpmodel is set to 1 (settings below). I have run sudo jetson_clocks.

image

I am aware from looking at some of the other issues that this reported fps corresponds to inference only, so unsure why it is so slow (significantly slower than just testing with tensorrt with ./trtexec)

lpkoh commented 2 years ago

image

Result on csp. Don't think it is a thermal throttling issue as Jetson AGX Xavier is cool to the touch and I have a fan blowing directly at it.

lpkoh commented 2 years ago

Hi,

I have repeated the experiment. The original one was with a low power setting.

This is my environment:

Other details:

Results: image

Another result, this time with Yolov4-csp-512x512 fp16: image

I have two questions:

  1. The first result does not match the 41.01 FPS from AGX Xavier tolo4 416 result image Why is this so? Could it be because I am on MODE 30W 6CORE vs MAXN setting for AGX Xavier? I can't test this as I face an issue where the device shuts off when I run it with nvpmodel -m 0
  2. The results seem to be slower than just pure tensorrt. I ran a separate experiment with just ./trtexec on darknet weights that were converted to trt. I ran this multiple times, including on Yolov4-csp-512x512-fp16 (same number of classes, filters, etc.). The nvpmodel and jetson clocks were all the same. However, I obtained a result of 37.4 fps vs 21.9 fps above. As both are inference only fps, this would imply tensorrt + tkdnn is actually slower, all else held constant (as far as I can see). Is there a reason for this? Am I not maxing out tkdnn in some way? As it seems to be slower than raw tensorrt.
lpkoh commented 2 years ago

I have re run the result with adac857, thinking it might be due to this issue: https://github.com/ceccocats/tkDNN/issues/226

However, the results have actually slightly worsened, with ~18 fps on yolo4-csp. Can anyone advise?

mive93 commented 2 years ago

Hi @lpkoh,

Three considerations:

Finally, yolo4-csp is not Yolov4, it's Scaled Yolo and it is slower, but more precise.

Let me know if you have further questions.

mive93 commented 2 years ago

Actually I get very similar results for Yolov4 and Yolov4-csp. These results are obtained on a Xavier AGX, with Jetpack 4.5 with full precision (FP32), selecting only those models in this script .

test avg ms min ms max ms avg FPS
yolo4_fp32_2 47.3199 46.5271 63.1509 21.1328
yolo4-csp_fp32_2 51.1207 50.8716 51.8859 19.5615
lpkoh commented 2 years ago

Hi thank you for replying on this.

I am confused. You said here and https://github.com/ceccocats/tkDNN/issues/186 and https://github.com/ceccocats/tkDNN/issues/173 that what demo prints on screen is the preprocessing + inference + postprocessing. I thought what the demo prints on screen = demo output, hence I thought that the "only inference fps" on tkDNN was slower than the "only inference fps" on ./trtexec. Can I check where do I find the demo output that corresponds to just inference, no pre/post processing then? I don't find that information here.

Also as I understand tkDNN is a wrapper around tensorrt and cudnn. Does this mean its actually meant to be faster than just running ./trtexec on a jetson board, at least theoretically?

mive93 commented 2 years ago

Yeah, you are actually right. In the past the demo was printing also pre/post, but currently it prints the inference time only, so what you see is the inference time. For ./test_rtinference and the script scripts/test_all_tests.sh it's the same.

Yes, tkDNN is just a wrapper of tensorRT and cuDNN. It is just a framework that we use to optimize NN for our projects. It is not because it's faster that we develop it, but to easily port not supported models.

lpkoh commented 2 years ago

Ah gotcha. So I guess the difference between ~27 fps on yolo4-416x416 vs ~44 in your repo is probably down to MAXN? Could the Tensorrt version difference be an issue? I am using 7, your repo mentions 8. I heard 8 is faster, but for things like transformers, not yolo.

mive93 commented 2 years ago

Maybe it's due to MAXN and jetson_clock. Jetpack 4.5 uses TensorRT 7. TensorRT8, that will be supported by tkDNN very soon, is actually slower on Jetson platform for now. We hope NVIDIA will solve the issue with the next minor release.

mive93 commented 2 years ago

TensorRT8 is now supported on tensorrt8 branch.

masip85 commented 1 year ago

Does TensorRT8 still being slower?