ApolloAuto / apollo

An open autonomous driving platform
Apache License 2.0
25.01k stars 9.67k forks source link

Can I upgrade the CUDA and TensorRT version inside the apollo docker? #14858

Open mgcho0608 opened 1 year ago

mgcho0608 commented 1 year ago

Hi, I'm having troubles in using Apollo 8.0 v with RTX 4090 (#14808, #14836). While analyzing the causes of the Apollo not working, many errors related to the GPU architecture such as sm_89, compute_89, nvcc were found, so I am planning to upgrade to the latest CUDA and TensorRT suitable for RTX 4090 to see if it resolves the issues.

  1. Is there any problem upgrading CUDA and TensorRT? I have seen posts stating that the model infer time increased after upgrading TensorRT, but since Apollo doesn't even work at all, delay increase is a secondary issue for me.

  2. CUDA and TensorRT are fixed at version 11.1 and 7.2.1, respectively, every time the docker image is started. Is there any way to maintain the newly installed versions of CUDA and TensorRT after removing the default versions?

daohu527 commented 1 year ago

Of course, but the corresponding tensorrt interface may need to be modified. We have upgraded a version, but currently it is mainly for Orin.

mgcho0608 commented 1 year ago

Thanks for your answer! Is there any way to set the CUDA and TensorRT version before start the docker? I'm trying to remove and reinstall the CUDA and TensorRT, but facing lots of dependency issues (such as E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).). Also, since the dev_start.sh and dev_into.sh reset the CUDA and TensorRT version to the default, my approach cannot be the essential solution.

daohu527 commented 1 year ago

You need to rebuild the docker! this may take some time, I will see if there will be an upgrade, but to be honest you may have to wait for a while

mgcho0608 commented 1 year ago

Thanks for your answer! I'll follow these steps to try to rebuild the docker to solve my issues. Please let me know anything is wrong or unclear.

  1. Modify the docker build scripts 'build_docker.sh' to change the version of CUDA, CUDNN, TensorRT.
CUDA_LITE=
CUDNN_VERSION=
TENSORRT_VERSION=
function determine_cuda_versions() {
    local arch="$1"
    local dist="$2"
    if [[ "${arch}" == "x86_64" ]]; then
        if [[ "${dist}" == "stable" ]]; then
            CUDA_LITE=11.1
            CUDNN_VERSION="8.0.4.30"
            TENSORRT_VERSION="7.2.1"
        else # testing
            CUDA_LITE=11.1
            CUDNN_VERSION="8.0.4.30"
            TENSORRT_VERSION="7.2.1"
        fi
    else # aarch64
        CUDA_LITE="10.2"
        CUDNN_VERSION="8.0.0.180"
        TENSORRT_VERSION="7.1.3"
    fi
}

Here, I'll change CUDA_LITE to 12.0, CUDNN_VERSION to 8.8.0, TENSORRT_VERSION to 8.6.0.

  1. Then, build apollo dev image with this command.

    build_docker.sh -f dev.x86_64.nvidia.dockerfile
  2. Change the dev_start.sh script to use new generated docker image.

VERSION_X86_64="dev-x86_64-18.04-20221124_1708"
....
....
function determine_dev_image() {
    local docker_repo="${DOCKER_REPO}"
    local version="$1"
    # If no custom version specified
    if [[ -z "${version}" ]]; then
        if [[ "${TARGET_ARCH}" == "x86_64" ]]; then
            if [[ ${USE_AMD_GPU} == 1 ]]; then
                docker_repo="${ROCM_DOCKER_REPO}"
                version="${VERSION_ROCM_X86_64}"
            elif (($USE_NVIDIA_GPU == 1)) || (($USE_GPU_HOST == 0)); then
                if [[ "${CUSTOM_DIST}" == "testing" ]]; then
                    version="${TESTING_VERSION_X86_64}"
                else
                    version="${VERSION_X86_64}"
                fi
            fi
        elif [[ "${TARGET_ARCH}" == "aarch64" ]]; then
            version="${VERSION_AARCH64}"
        else
            error "Logic can't reach here! Please report this issue to Apollo@GitHub."
            exit 3
        fi
    fi
    DEV_IMAGE="${docker_repo}:${version}"
}

Here, I'll change VERSION_X86_64 to generated image file.

WildBeast114514 commented 1 year ago

@mgcho0608 Actually this issue cannot be solved by changing the content you mentioned above.

When the cuda version changes, the corresponding tensorrt api and libtorch api may also change, meaning that libraries strongly associated with cuda are not very backward compatible.

Unfortunately, the current version of apollo is not compatible with the tensorrt interface for cuda versions 11.4 and above, so there is no way to fix this issue by upgrading cuda.

Back into your case, you can try modifying the following file to try to be compatible with 4090:

$ vim /apollo/tools/bootstrap.py

Modify the _DEFAULT_CUDA_COMPUTE_CAPABILITIES variable to the following values: '3.7,5.2,6.0,6.1,7.0,7.2,7.5,8.6,8.9'

finally delete the rc file and try to build apollo:

$ rm -f /apollo/.apollo.bazelrc
$ bash apollo.sh build_opt_gpu

Also, we are currently working on cuda 11.4 support for apollo, so please be patient and wait for a subsequent apollo update!

mgcho0608 commented 1 year ago

Thanks for your answer! Unfortunately, the solution to modify the bootstrap.py didn't resolve my problem.

Here, I want to explain my current problem in more detail to get an useful solution. To do this, I reset my Ubuntu and followed the processes step by step.

  1. I cloned the code at 9537abebf3e180e8976489bab7166da6d87efaf4 commit version. (Right before adding support for AMD gpu) I modified the _DEFAULT_CUDA_COMPUTE_CAPABILITIES as @WildBeast114514 mentioned.

  2. Then, I run dev_start.sh, dev_into.sh, sudo bash apollo.sh build_opt_gpu. During sudo bash apollo.sh build_opt_gpu, I got this error.

Screenshot from 2023-03-29 14-37-09

To resolve this error, I modified .apollo.bazelrc like this. Then, try again sudo bash apollo.sh build_opt_gpu without error.

Screenshot from 2023-03-29 14-38-06

  1. In the docker, I run 'bootstrap_lgsvl.sh', 'bridge.sh'. After that, I run the lgsvl simulator with Python API.

  2. Then, I run the python3 code in my local env.

  3. As a result, I got this error. And this is my nvidia-smi screen. Screenshot from 2023-03-29 16-39-36

Screenshot from 2023-03-29 16-50-09

At first, 3 processes tagged 'mainboard' were launched, but suddenly removed. I think those 3 processes are Perception, Prediction, Traffic Light modules.

  1. To analyze the reason for why modules were not launched, I manually launched preception and prediction modules and get these error messages. mainboard -d modules/prediction/dag/prediction.dag

Screenshot from 2023-03-29 16-53-10

mainboard -d modules/perception/production/dag/dag_streaming_perception.dag

Screenshot from 2023-03-29 16-55-28

My setting is Ubuntu 20.04 with RTX 4090. I thought those two error messages pointed out the mismatch of nvidia driver (unsupported gpu arch and nvrtc error), so I thought upgrading the CUDA and TensorRT version of the docker by rebuilding the docker image can be the solution.

Waiting for next patch can be the solution, but as @daohu527 mentioned, I have to wait for a while. Any good ideas for me?

WildBeast114514 commented 1 year ago

@mgcho0608 According to the error message, the current containers for cuda and tensorrt are indeed not 4090 compliant, and it looks like upgrading cuda and tensorrt is the only solution.

As apollo support for higher versions of tensorrt is currently under development, you can start by adapting the perception module to support higher versions of tensorrt on your own, here is a summary of some of the work that needs to be done:

  1. The DimsCHW and DimsNCHW classes were removed in tensorrt 8 and the perception module uses them in a number of places, so it was necessary to create a wrapper for them and include header files where they are referenced, here are the wrapper sample:
#pragma once

#include <NvInferLegacyDims.h>

namespace nvinfer1 {

class DimsNCHW : public Dims4 {
 public:
    DimsNCHW() : Dims4() {}
    DimsNCHW(
      int32_t batch_size, int32_t channels,
      int32_t height, int32_t width)
        : Dims4(batch_size, channels, height, width) {}

    int32_t& n() {
      return d[0];
    }

    int32_t n() const {
      return d[0];
    }

    int32_t& c() {
      return d[1];
    }

    int32_t c() const {
      return d[1];
    }

    int32_t& h() {
      return d[2];
    }

    int32_t h() const {
      return d[2];
    }

    int32_t& w() {
      return d[3];
    }

    int32_t w() const {
      return d[3];
    }
};

class DimsCHW : public Dims3 {
 public:
    DimsCHW() : Dims3() {}
    DimsCHW(int32_t channels, int32_t height, int32_t width)
      : Dims3(channels, height, width) {}

    int32_t& c() {
      return d[0];
    }

    int32_t c() const {
      return d[0];
    }

    int32_t& h() {
      return d[1];
    }

    int32_t h() const {
      return d[1];
    }

    int32_t& w() {
      return d[2];
    }

    int32_t w() const {
      return d[2];
    }
};

}  // namespace nvinfer1
  1. apollo uses the interface nfinfer1::IPlugin, which is replaced by nfinfer1::IPluginV2Ext in tensorrt8, so you need to adapt all classes that use nfinfer1::IPlugin to nfinfer1::IPluginV2Ext. These files are in the folder modules/perception/inference/tensorrt/plugins. Here is sample code for two interface adaptations of softmax_plugin.h:
#pragma once

#include "modules/perception/inference/tensorrt/rt_common.h"

namespace apollo {
namespace perception {
namespace inference {

#ifndef TENSORRT_8
class SoftmaxPlugin : public nvinfer1::IPlugin {
 public:
  SoftmaxPlugin(const SoftmaxParameter &param, nvinfer1::Dims in_dims) {
    input_dims_.nbDims = in_dims.nbDims;
    for (int i = 0; i < in_dims.nbDims; i++) {
      input_dims_.d[i] = in_dims.d[i];
      input_dims_.type[i] = in_dims.type[i];
    }
    axis_ = param.axis() - 1;
    CHECK_GE(axis_, 0);
    CHECK_LE(axis_ + 1, input_dims_.nbDims);

    inner_num_ = 1;
    for (int i = axis_ + 1; i < input_dims_.nbDims; i++) {
      inner_num_ *= input_dims_.d[i];
    }
    outer_num_ = 1;
    for (int i = 0; i < axis_; i++) {
      outer_num_ *= input_dims_.d[i];
    }
    cudnnCreateTensorDescriptor(&input_desc_);
    cudnnCreateTensorDescriptor(&output_desc_);
  }

  SoftmaxPlugin() {}

  ~SoftmaxPlugin() {
    cudnnDestroyTensorDescriptor(input_desc_);
    cudnnDestroyTensorDescriptor(output_desc_);
  }
  virtual int initialize() {
    cudnnCreate(&cudnn_);  // initialize cudnn and cublas
    cublasCreate(&cublas_);
    return 0;
  }
  virtual void terminate() {
    cublasDestroy(cublas_);
    cudnnDestroy(cudnn_);
  }
  int getNbOutputs() const override { return 1; }

  nvinfer1::Dims getOutputDimensions(int index, const nvinfer1::Dims *inputs,
                                     int nbInputDims) override {
    nvinfer1::Dims out_dims = inputs[0];
    return out_dims;
  }

  void configure(const nvinfer1::Dims *inputDims, int nbInputs,
                 const nvinfer1::Dims *outputDims, int nbOutputs,
                 int maxBatchSize) override {
    input_dims_ = inputDims[0];
  }

  size_t getWorkspaceSize(int maxBatchSize) const override { return 0; }

  int enqueue(int batchSize, const void *const *inputs, void **outputs,
              void *workspace, cudaStream_t stream) override;

  size_t getSerializationSize() override { return 0; }

  void serialize(void *buffer) override {
    char *d = reinterpret_cast<char *>(buffer), *a = d;
    size_t size = getSerializationSize();
    CHECK_EQ(d, a + size);
  }

 private:
  cudnnHandle_t cudnn_;
  cublasHandle_t cublas_;
  nvinfer1::Dims input_dims_;
  int axis_;
  int inner_num_;
  int outer_num_;
  cudnnTensorDescriptor_t input_desc_;
  cudnnTensorDescriptor_t output_desc_;
};

#else
class SoftmaxPlugin : public nvinfer1::IPluginV2Ext {
 public:
  SoftmaxPlugin(const SoftmaxParameter &param, nvinfer1::Dims in_dims) {
    input_dims_.nbDims = in_dims.nbDims;
    for (int i = 0; i < in_dims.nbDims; i++) {
      input_dims_.d[i] = in_dims.d[i];
    }
    axis_ = param.axis() - 1;
    CHECK_GE(axis_, 0);
    CHECK_LE(axis_ + 1, input_dims_.nbDims);

    inner_num_ = 1;
    for (int i = axis_ + 1; i < input_dims_.nbDims; i++) {
      inner_num_ *= input_dims_.d[i];
    }
    outer_num_ = 1;
    for (int i = 0; i < axis_; i++) {
      outer_num_ *= input_dims_.d[i];
    }
    cudnnCreateTensorDescriptor(&input_desc_);
    cudnnCreateTensorDescriptor(&output_desc_);
  }

  SoftmaxPlugin() {}

  ~SoftmaxPlugin() {
    cudnnDestroyTensorDescriptor(input_desc_);
    cudnnDestroyTensorDescriptor(output_desc_);
  }
  virtual int32_t initialize() noexcept {
    cudnnCreate(&cudnn_);  // initialize cudnn and cublas
    cublasCreate(&cublas_);
    return 0;
  }
  virtual void terminate() noexcept {
    cublasDestroy(cublas_);
    cudnnDestroy(cudnn_);
  }
  int32_t getNbOutputs() const noexcept override { return 1; }

  nvinfer1::Dims getOutputDimensions(int32_t index,
      const nvinfer1::Dims *inputs, int32_t nbInputDims)
      noexcept override {
    nvinfer1::Dims out_dims = inputs[0];
    return out_dims;
  }

  void configureWithFormat(const nvinfer1::Dims *inputDims, int32_t nbInputs,
                 const nvinfer1::Dims *outputDims, int32_t nbOutputs,
                 nvinfer1::DataType type, nvinfer1::PluginFormat format,
                 int32_t maxBatchSize) noexcept override {
    input_dims_ = inputDims[0];
  }

  size_t getWorkspaceSize(int32_t maxBatchSize)
      const noexcept override { return 0; }

  int32_t enqueue(int32_t batchSize, const void *const *inputs,
              void *const *outputs, void *workspace, cudaStream_t stream)
              noexcept override;

  size_t getSerializationSize() const noexcept override { return 0; }

  void serialize(void *buffer) const noexcept override {
    char *d = reinterpret_cast<char *>(buffer), *a = d;
    size_t size = getSerializationSize();
    CHECK_EQ(d, a + size);
  }

  nvinfer1::AsciiChar const* getPluginType()
      const noexcept override {
    return plugin_type;
  }

  nvinfer1::AsciiChar const* getPluginVersion()
      const noexcept override {
    return plugin_version;
  }

  void setPluginNamespace(const nvinfer1::AsciiChar* libNamespace)
      noexcept override {
    plugin_namespace = const_cast<nvinfer1::AsciiChar*>(libNamespace);
  }

  nvinfer1::AsciiChar const* getPluginNamespace()
      const noexcept override {
    return const_cast<nvinfer1::AsciiChar*>(plugin_namespace);
  }

  bool supportsFormat(nvinfer1::DataType type,
      nvinfer1::PluginFormat format) const noexcept override {
    return true;
  }

  void destroy() noexcept override {
    delete this;
  }

  nvinfer1::IPluginV2Ext* clone() const noexcept override {
    SoftmaxPlugin* p = new SoftmaxPlugin();
    cudnnCreate(&(p->cudnn_));  // initialize cudnn and cublas
    cublasCreate(&(p->cublas_));
    p->axis_ = axis_;
    p->inner_num_ = inner_num_;
    p->outer_num_ = outer_num_;
    p->plugin_namespace = plugin_namespace;
    (p->input_dims_).nbDims = input_dims_.nbDims;
    for (int i = 0; i < input_dims_.nbDims; i++) {
      (p->input_dims_).d[i] = input_dims_.d[i];
    }
    cudnnCreateTensorDescriptor(&(p->input_desc_));
    cudnnCreateTensorDescriptor(&(p->output_desc_));
    return p;
  }

  bool isOutputBroadcastAcrossBatch(int32_t outputIndex,
      bool const *inputIsBroadcasted, int32_t nbInputs)
      const noexcept override {
    return false;
  }

  bool canBroadcastInputAcrossBatch(int32_t inputIndex)
      const noexcept override {
    return false;
  }

  nvinfer1::DataType getOutputDataType(int32_t index,
      nvinfer1::DataType const *inputTypes, int32_t nbInputs)
      const noexcept {
    return nvinfer1::DataType::kFLOAT;
  }

  void configurePlugin(
    nvinfer1::Dims const *inputDims, int32_t nbInputs,
    nvinfer1::Dims const *outputDims, int32_t nbOutputs,
    nvinfer1::DataType const *inputTypes,
    nvinfer1::DataType const *outputTypes,
    bool const *inputIsBroadcast, bool const *outputIsBroadcast,
    nvinfer1::PluginFormat floatFormat, int32_t maxBatchSize) noexcept {}

 private:
  cudnnHandle_t cudnn_;
  cublasHandle_t cublas_;
  nvinfer1::Dims input_dims_;
  int axis_;
  int inner_num_;
  int outer_num_;
  nvinfer1::AsciiChar* plugin_namespace;
  const nvinfer1::AsciiChar* plugin_type = "";
  const nvinfer1::AsciiChar* plugin_version = "";
  cudnnTensorDescriptor_t input_desc_;
  cudnnTensorDescriptor_t output_desc_;
};

#endif

}  // namespace inference
}  // namespace perception
}  // namespace apollo
  1. modules/perception/inference/tensorrt/rt_net.cc uses many of the api only in tensorrt 7, these api need to be replaced with the corresponding api from tensorrt 8. You can do this by following the compilation log and the official tensorrt documentation.
  2. Upgrade cuda、cudnn、tensorrt and libtorch according to Nvidia documentation.

Due to the complexity of the above work, we still recommend that you wait for the next upgrade of apollo.

mgcho0608 commented 1 year ago

@WildBeast114514 Thanks for your answer!

I understand that upgrading cuda, cudnn, tensorrt needs a lot of works much more than I expected. You gave me instructions for modifying perception module, but other gpu-related modules such as prediction, planning also may need modifications, too.

So, before upgrading the versions, I'm trying to modify the build of core modules. Specifically, I suspect that some of the core modules were built with sm_89 or CUDA_COMPUTE_COMPATIBILITY set to 8.9.

This searching result supports my inference. The core modules are executed through the mainboard, and it was confirmed that sm_89 is included in core_mainboard.

Screenshot from 2023-03-30 16-08-49

If this approach does not solve my issue, as you mentioned, I should do version upgrade (or waiting for next upgrade), but that seems really a huge work. Therefore, I want to check if this approach works first, and if there is any progress or any further questions, I'll post it to request assistance.

WilliaJing commented 1 year ago

@mgcho0608 @WildBeast114514 @daohu527 Hi,I would like to know if the RTX4070 is now supported? or when do you plan to release?

daohu527 commented 10 months ago

Can you try the following methods? It will try to downgrade the computing power of the graphics card to temporarily avoid upgrading CUDA.

add code in function https://github.com/ApolloAuto/apollo/blob/a3c851fc5844e0684b9c5108231fcc2c15cebb8e/third_party/gpus/cuda_configure.bzl#L731

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_75":    # add
            capability = "compute_75"    # add

        if capability.startswith("compute_"):
MingfeiCheng commented 10 months ago

Can you try the following methods? It will try to downgrade the computing power of the graphics card to temporarily avoid upgrading CUDA.

add code in function

https://github.com/ApolloAuto/apollo/blob/a3c851fc5844e0684b9c5108231fcc2c15cebb8e/third_party/gpus/cuda_configure.bzl#L731

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_75":    # add
            capability = "compute_75"    # add

        if capability.startswith("compute_"):

Hi, I tried this method, and the Apollo can be compiled successfully. But opening prediction module by mainboard -d /apollo/modules/prediction/dag/prediction.dag will trigger the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void func_1(float* t0, float* t1, float* aten_relu_flat) {
{
  if (512 * blockIdx.x + threadIdx.x<2 ? 1 : 0) {
    aten_relu_flat[512 * blockIdx.x + threadIdx.x] = (((__ldg(t0 + (512 * blockIdx.x + threadIdx.x) % 2)) + (__ldg(t1 + (512 * blockIdx.x + threadIdx.x) % 2))<0.f ? 1 : 0) ? 0.f : (__ldg(t0 + (512 * blockIdx.x + threadIdx.x) % 2)) + (__ldg(t1 + (512 * blockIdx.x + threadIdx.x) % 2)));
  }
}
}

My computer has RTX4090. Could you give me some suggestions on that? Thank you!

TayYim commented 6 months ago

@mgcho0608 Hey, I just found that the official document updates how to install Apollo 9.0.0 on 40XX GPU PC. I tried it with my 4070Ti machine and it works well.

See the instructions here (it's in Chinese, not sure if there is an English version): Source Install EXtra steps for 40XX GPUs