NVIDIA-ISAAC-ROS / isaac_ros_pose_estimation

Deep learned, NVIDIA-accelerated 3D object pose estimation
https://developer.nvidia.com/isaac-ros-gems
Apache License 2.0
142 stars 21 forks source link

Is this the correct way to resolve an illegal memory access error during Triton sample run? #12

Closed shifsa closed 1 year ago

shifsa commented 1 year ago

I have tried to do a sample run of attitude pose estimation in Triton at the following URL https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_pose_estimation/blob/main/docs/dope-triton.md During the inference, the following error occurred and Triton stopped.

[component_container_mt-1] I0126 01:35:07.639006 9674 tensorrt.cc:1546] Created instance Ketchup on GPU 1 with stream priority 0 and optimization profile default[0];
[component_container_mt-1] I0126 01:35:07.639250 9674 model_lifecycle.cc:693] successfully loaded 'Ketchup' version 1
[component_container_mt-1] [INFO] [1674696951.465031002] [NitrosContext]: [NitrosContext] Loading application: '/workspaces/isaac_ros-dev/install/isaac_ros_nitros/share/isaac_ros_nitros/config/type_adapter_nitros_context_graph.yaml'
[component_container_mt-1] INFO: infer_simple_runtime.cpp:70 TrtISBackend id:90 initialized model: Ketchup
[component_container_mt-1] 2023-01-26 01:35:51.851 WARN  extensions/triton/inferencers/triton_inferencer_impl.cpp@419: Incomplete inference; response appeared out of order; invalid: index = 1
[component_container_mt-1] 2023-01-26 01:35:51.851 WARN  extensions/triton/inferencers/triton_inferencer_impl.cpp@421: Inference appeared out of order
[component_container_mt-1] 2023-01-26 01:35:53.663 WARN  extensions/triton/inferencers/triton_inferencer_impl.cpp@419: Incomplete inference; response appeared out of order; invalid: index = 2
[component_container_mt-1] 2023-01-26 01:35:53.663 WARN  extensions/triton/inferencers/triton_inferencer_impl.cpp@421: Inference appeared out of order
[component_container_mt-1] ERROR: infer_trtis_server.cpp:259 Triton: TritonServer response error received., triton_err_str:Internal, err_msg:INPUT__0: failed to perform CUDA copy: an illegal memory access was encountered
[component_container_mt-1] ERROR: infer_trtis_backend.cpp:603 Triton server failed to parse response with request-id:436 model:
[component_container_mt-1] ERROR: infer_trtis_server.cpp:259 Triton: TritonServer response error received., triton_err_str:Internal, err_msg:INPUT__0: failed to perform CUDA copy: an illegal memory access was encountered
[component_container_mt-1] ERROR: infer_trtis_backend.cpp:603 Triton server failed to parse response with request-id:437 model:
[component_container_mt-1] [ERROR] [1674696993.171526632] [NitrosImage]: [convert_to_custom] cudaMemcpy2D failed for conversion from sensor_msgs::msg::Image to NitrosImage: cudaErrorIllegalAddress (an illegal memory access was encountered)
[component_container_mt-1] 2023-01-26 01:36:33.172 WARN  gxf/std/greedy_scheduler.cpp@221: Error while executing entity 101 named 'UHCWSJBWPU_tensor_copier': GXF_OUT_OF_MEMORY
[component_container_mt-1] 2023-01-26 01:36:33.172 ERROR gxf/std/entity_executor.cpp@200: Entity with 108 not found!
[component_container_mt-1] [ERROR] [1674696993.172143709] [dope_inference]: [NitrosPublisher] Vault ("vault/vault", eid=108) was stopped. The graph may have been terminated due to an error.
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1] terminate called recursively
[ERROR] [component_container_mt-1]: process has died [pid 9674, exit code -6, cmd '/opt/ros/humble/install/lib/rclcpp_components/component_container_mt --ros-args -r __node:=dope_container -r __ns:=/'].

I searched for the error statement and found a method to change the environment variable CUDA_LAUNCH_BLOCKING=1. I tried this environment variable and it now runs without errors. Is this the correct way to resolve this error?

hemalshahNV commented 1 year ago

The error is that your device ran out of GPU memory when loading this model and running inference. Using CUDA_LAUNCH_BLOCKING=1 (see here), we force CUDA to load launch blocks in sequence rather in parallel which is likely lowering your memory high watermark enough to run. It will resolve the error for you at the cost of launch time performance.