autowarefoundation / autoware

Autoware - the world's leading open-source software project for autonomous driving
https://www.autoware.org/
Apache License 2.0
8.59k stars 2.88k forks source link

fix(cuda): install NVML development library #4621

Closed ito-san closed 2 months ago

ito-san commented 2 months ago

Description

Recently, after running setup_dev_env.sh and installing NVIDIA libraries, there's an issue where part of NVML (nvml.h) is not installed. This affects the gpu_monitor node in system_monitor, which uses NVML. The gpu_monitor recognized NVML doesn't exist and publish errors as it is unable to access the GPU.

image

See also https://github.com/autowarefoundation/autoware.universe/issues/6787.

I'd like to explicitly install NVML as a workaround for this issue.

Tests performed

  1. Completely remove NVIDIA drivers and libraries.

    sudo apt purge cuda-*
    sudo apt purge nvidia-*
    sudo apt purge libcudnn*
    sudo apt purge libnv*
  2. Confirm that only hwloc/nvml.h exists.

    ❯ find /usr -type f -name nvml.h
    /usr/include/hwloc/nvml.h
  3. Run setup_dev_env.sh and installi NVIDIA libraries.

    ❯ ./setup-dev-env.sh
    ...
    [Warning] Some Autoware components depend on the CUDA, cuDNN and TensorRT NVIDIA libraries which have end-user     license agreements that should be reviewed before installation.
    Install NVIDIA libraries? [y/N]: y
  4. Confirm that the NVIDIA's nvml.h is installed

    ❯ find /usr -type f -name nvml.h
    /usr/local/cuda-12.3/targets/x86_64-linux/include/nvml.h
    /usr/include/hwloc/nvml.h
  5. Delete the build and install directories for system_monitor.

    ❯ rm -rf install/system_monitor/ build/system_monitor/
  6. Build system_monitor and ensure build uses NVML (GPU PLATFORM: nvml), and build completes successfully.

    ❯ colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-w" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache --event-handlers console_direct+ --packages-select system_monitor
    ...
    -- SYSTEM_PROCESSOR: x86_64
    -- CPU PLATFORM: intel
    -- GPU PLATFORM: nvml
    ...
    Finished <<< system_monitor [3min 2s]
    
    Summary: 1 package finished [3min 4s]
      1 package had stderr output: system_monitor
  7. Run Autoware.

    ros2 launch autoware_launch planning_simulator.launch.xml map_path:=/data/sample-map-planning vehicle_model:=sample_vehicle sensor_model:=sample_sensor_kit launch_system_monitor:=true
  8. Run runtime_monitor and Confirm the gpu_monitor does not report an error.

    ros2 run rqt_runtime_monitor rqt_runtime_monitor

image

Effects on system behavior

Not applicable.

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

After all checkboxes are checked, anyone who has write access can merge the PR.