Triton inference very slow - Githubissues

NVIDIA-ISAAC-ROS / isaac_ros_dnn_inference

NVIDIA-accelerated DNN model inference ROS 2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU

https://developer.nvidia.com/isaac-ros-gems

Apache License 2.0

103 stars 17 forks source link

Triton inference very slow #15

Closed GEngels closed 1 year ago

GEngels commented 2 years ago

I am trying to run inference with triton on a Jetson Xavier AGX. I am using it on MAX-N but only seem to get 1-2 fps on PeopleNet with half precision. I am using the settings and configs in isaac_ros_object_detection and I am running everything in the docker container from isaac_ros_common. I am using an intel realsense camera with 1280x800 images. Building seems to go fine and when launch the node with "ros2 launch isaac_ros_detectnet isaac_ros_detectnet.launch.py" it displays the following information:

triton_start.txt

When running it shows:

[component_container-1] 2022-07-01 13:13:38.271 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@394: Triton Async Event DONE for index = 1430 [component_container-1] 2022-07-01 13:13:38.271 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@423: Trying to load inference for index: 1430 [component_container-1] 2022-07-01 13:13:38.271 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@433: Successfully loaded inference for: 1430 [component_container-1] 2022-07-01 13:13:38.272 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@483: Raw Outputs size = 2 [component_container-1] 2022-07-01 13:13:38.272 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@495: Batch size output 'output_bbox/BiasAdd' = 1 [component_container-1] 2022-07-01 13:13:38.272 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@495: Batch size output 'output_cov/Sigmoid' = 1 [component_container-1] 2022-07-01 13:13:38.272 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@575: incomplete_inference_count_ = 0 [component_container-1] 2022-07-01 13:13:38.272 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@577: Last inference reached; setting Async state to WAIT [component_container-1] 2022-07-01 13:13:39.384 DEBUG extensions/triton/inferencers/triton_inferencer_impl.cpp@305: input tensor name = input_1 [component_container-1] 2022-07-01 13:13:39.401 DEBUG external/com_nvidia_gxf/gxf/std/scheduling_terms.cpp@434: Sending event notification for entity 8

Everything seems to run but just very slow. For a second we thought it might be running on the CPU because also the --gpus .. is not attached in docker run but that didn't change anything so that doesn't seem to be the case either. Any idea what we are doing wrong and how we can get the reported speed?

Doch88 commented 2 years ago

Hi! @hemalshahNV or @jaiveersinghNV is there some update about this issue?

hemalshahNV commented 2 years ago

Very strange. My first thought would have been power mode, but it looks like you have that covered.

We just released Isaac ROS Developer Preview based on ROS2 Humble with significant changes to isaac_ros_dnn_inference to support NITROS (NVIDIA Isaac Transport for ROS). Could you try running again with the new release and see if the slow down persists?

Doch88 commented 2 years ago

Unfortunately, we have to use ROS2 Foxy so using ROS2 Humble is not an option for us. Does the new release also work with ROS2 Foxy?

hemalshahNV commented 2 years ago

We leveraged new features in ROS2 Humble we worked with Open Robotics on, so Isaac ROS DP no longer works with ROS2 Foxy. The previous release on Foxy should have worked though of course so we will just need to diagnose the original problem more.

Doch88 commented 2 years ago

Yeah, I think that if the problem is there with the previous version, then it will be there also in this newer one. But we checked many things and we did not find where the issue can be located. Maybe you know what we can check? I thought it was something related to the Triton configuration, like the chosen scheduler.

kajanan-nvidia commented 2 years ago

Hello, two things you can try out to isolate the problem are:

Checking the model you obtained. Try using the pruned version of the model. Additionally, is using int8 inference instead of fp16 inference possible? Please refer to: https://catalog.ngc.nvidia.com/orgs/nvidia/models/tlt_peoplenet for more details
Checking the publishing rate of all the relevant topics that are published: ros2 topic hz <topic-name>. For example, check the rate of the camera source, the input tensor to Triton, and the output tensor of Triton.

Doch88 commented 2 years ago

We are using indeed the pruned version of the model, although with fp16 inference. This model worked fine when used with another deepstream app directly and it was very fast, so I would expect to be the same here.

Using ros2 topic hz I noticed some weird behaviors. The camera node (which is the RealSense D455 node) was started outside of a docker cointainer, so directly in the local machine. Outside of docker, running ros2 topic hz the average rate is more or less 10 hz, which is slow but that's another issue that we are trying to solve in the realsense ros wrapper GitHub repo. If I run ros2 topic hz inside the docker container (with the proper ros domain settings and stuff set) it seems to become 4 hz, so half the speed for some reason. And ros2 topic hz for \tensor_sub is 1Hz.

So Indeed the problem is not the inference but something weird happening to the messages.

Doch88 commented 2 years ago

Running the RealSense node inside the container we increased the Hz of \tensor_sub by 0.4 Hz, still not enough. \tensor_pub is around 0.8 Hz. There is something inside the isaac ros encoder that is very slow, I will try to check.

Doch88 commented 2 years ago

It seems that the encoder publishes correctly and with the right speed, until someone, like the isaac ros dnn inference node, subscribes to it. Then it goes from the ~9-10 Hz to ~0.4-0.5 Hz. I guess then probably it is something related to the DDS implementation.

hemalshahNV commented 2 years ago

We had also observed this slowdown as reported by ros2 topic hz inside and outside of the container. We haven't narrowed down where the issue could be yet (ros2 topic hz itself, FastRTPS using shared memory mounted between the container and host, something else). Since all of the critical message traffic is intraprocess, however, only "external" subscribers like ros2 topic hz are monitoring topics via DDS which means that application itself may not be experiencing any such slowdown. Also, once you have an external subscriber, NITROS is forced to convert back to ROS messages to service that subscriber which slows things down considerably.

Doch88 commented 2 years ago

Probably we found the problem. It seems that the post-processing node was slowing down everything, I think the hybrid clustering was enabled and that was incredibly slow. Removing the whole clustering part on the post-processing node and putting an NMS from OpenCV there seems to solve everything. By doing this, we reach the same FPS as the camera.

hemalshahNV commented 2 years ago

You were seeing this slowdown with isaac_ros_detectnet decoder or your own?

Doch88 commented 2 years ago

We were using the original decoder from isaac_ros_object_detection

hemalshahNV commented 2 years ago

Thanks for bringing this up. We hadn't noticed any significant slowdown in DetectNet with PeopleNet but we'll investigate. Our benchmarks may not be exposing the slowdowns caused by post-processing DBScan filtering.

hemalshahNV commented 2 years ago

We fixed an error in the DetectNet decoder which was running DBScan filtering before thresholding the bounding box candidates. Please reverify in Isaac ROS DP1.1 (v0.11.0)

GEngels commented 2 years ago

We are currently migrating to your latest release but we do seem to have some problems with the speed (only 15fps) and the filtering of overlapping boxes seems to not many boxes with the default settings. We see a lot of overlapping boxes in the output.

EDIT: the speed is not an issue anymore, it was related to the realsense camera. But we would like to have an option to disable the post-processing so we can only use NMS without dbscan. Would that be possible to implement?

hemalshahNV commented 2 years ago

We'll pencil this in for an upcoming release for sure. It would be possible to implement now for you but you'd have to reimplement the rest of the decoder as well.