autowarefoundation / autoware

Autoware - the world's leading open-source software project for autonomous driving
https://www.autoware.org/
Apache License 2.0
8.58k stars 2.88k forks source link

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

Closed gitoabdelgawad closed 1 week ago

gitoabdelgawad commented 1 month ago

Checklist

Description

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container Im trying to use tensorrt_yolox package. The package includes some CUDA kernels which fails to build and shows the following warning:

--- stderr: tensorrt_yolox
CMake Warning at CMakeLists.txt:19 (message): CUDA is not found. preprocess acceleration using CUDA will not be available.

It seems that CMake variable CMAKE_CUDA_COMPILER is not set

Then while using tensorrt_yolox for object detection, the system crashes with the following error:

[tensorrt_yolox_node_exe-2] /home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe: symbol lookup error: /home/os/elm/autoware/install/tensorrt_yolox/lib/libtensorrt_yolox.so: undefined symbol: _ZN14tensorrt_yolox50resize_bilinear_letterbox_nhwc_to_nchw32_batch_gpuEPfPhiiiiiiifP11CUstream_st [ERROR] [tensorrt_yolox_node_exe-2]: process has died [pid 977, exit code 127, cmd '/home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe --ros-args -r __node:=tensorrt_yolox --params-file /tmp/launch_params_d1ll7q3z --params-file /tmp/launch_params_cq_ya7ic -r ~/in/image:=/fr_camera/image_rect -r ~/out/objects:=roi0'].

The missing symbol is actually a CUDA kernel that failed to build previously.

Expected behavior

  1. Docker OpenADK Image should have the CUDA support and be able to properly build tensorrt_yolox. By doing that, the runtime error of the missing symbol will not be there anymore.

Actual behavior

tensorrt_yolox builds with a Warning and skips building the CUDA kernels, which leads to a runtime crash later.

Steps to reproduce

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container

  1. source autoware/install/setup.bash
  2. colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-select tensorrt_yolox you should notice the cmake warning mentioned above.
  3. ros2 launch tensorrt_yolox yolox_s_plus_opt.launch.xml input/image:=/img output/objects:=/roi0 Thats an example for launch an object detection model. Once you subscribe to output topic ros2 topic echo /roi0 you should get the runtime error mentioned above.

Versions

No response

Possible causes

After some investigation and trying to build the official CUDA Samples to track the issue, it appeared that some cuda libraries were missing /usr/bin/ld: cannot find -lcudadevrt /usr/bin/ld: cannot find -lcudart_static

After applying the following patch and rebuilding the docker image, the cuda kernels were built and object detection model was running well.

From 52d5e470d616118d0089e1ff25e5c8016a95450b Mon Sep 17 00:00:00 2001
From: Osama Abdelgawad <oaohaeg@gmail.com>
Date: Wed, 22 May 2024 16:01:59 +0200
Subject: [PATCH] docker change

---
 docker/autoware-openadk/Dockerfile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docker/autoware-openadk/Dockerfile b/docker/autoware-openadk/Dockerfile
index 23d260f0..320262ff 100644
--- a/docker/autoware-openadk/Dockerfile
+++ b/docker/autoware-openadk/Dockerfile
@@ -88,9 +88,7 @@ ENV CXX="/usr/lib/ccache/g++"
 RUN --mount=type=ssh \
   ./setup-dev-env.sh -y --module all ${SETUP_ARGS} --no-cuda-drivers openadk \
   && pip uninstall -y ansible ansible-core \
-  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache \
-  && find / -name 'libcu*.a' -delete \
-  && find / -name 'libnv*.a' -delete
+  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache

 # Install rosdep dependencies
 COPY --from=src-imported /autoware/src /autoware/src
-- 
2.34.1

Additional context

No response

xmfcx commented 2 weeks ago

@gitoabdelgawad @oguzkaganozt does/did this PR fix this issue?

gitoabdelgawad commented 1 week ago

@gitoabdelgawad @oguzkaganozt does/did this PR fix this issue?

* [feat(docker): fix CUDA compile on devel image and improve run.sh #4849](https://github.com/autowarefoundation/autoware/pull/4849)

yes this PR fix this issue. Thanks I will close the issue