ApolloAuto / apollo

An open autonomous driving platform
Apache License 2.0
24.98k stars 9.66k forks source link

使用4090显卡编译代码报错 #14821

Closed gcx2020 closed 9 months ago

gcx2020 commented 1 year ago

(13:26:04) ERROR: /apollo/modules/perception/lidar/lib/detector/point_pillars_detection/BUILD:108:13: C++ compilation of rule '//modules/perception/lidar/lib/detector/point_pillars_detection:postprocess_cuda' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF ... (remaining 48 argument(s) skipped) nvcc fatal : Unsupported gpu architecture 'compute_89' (13:26:04) INFO: Elapsed time: 0.354s, Critical Path: 0.12s (13:26:04) INFO: 111 processes: 70 remote cache hit, 38 internal, 3 local. (13:26:04) FAILED: Build did NOT complete successfully

lovelyzzc commented 1 year ago

(16:52:18) ERROR: /apollo/modules/perception/camera/common/BUILD:104:12: C++ compilation of rule '//modules/perception/camera/common:image_data_operations' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/modules/perception/camera/common/_objs/image_data_operations/image_data_operations.pic.d ... (remaining 227 argument(s) skipped) nvcc fatal : Unsupported gpu architecture 'compute_89' (16:52:18) INFO: Elapsed time: 24.864s, Critical Path: 24.39s (16:52:18) INFO: 1383 processes: 244 internal, 1139 local. (16:52:18) FAILED: Build did NOT complete successfully

I have the same problem.

gcx2020 commented 1 year ago

(16:52:18) ERROR: /apollo/modules/perception/camera/common/BUILD:104:12: C++ compilation of rule '//modules/perception/camera/common:image_data_operations' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/modules/perception/camera/common/_objs/image_data_operations/image_data_operations.pic.d ... (remaining 227 argument(s) skipped) nvcc fatal : Unsupported gpu architecture 'compute_89' (16:52:18) INFO: Elapsed time: 24.864s, Critical Path: 24.39s (16:52:18) INFO: 1383 processes: 244 internal, 1139 local. (16:52:18) FAILED: Build did NOT complete successfully

I have the same problem.

你解决问题了吗?

ScottDeng114514 commented 1 year ago

我也遇到了这个问题,我发现目前apollo容器里面的cuda版本是11.1,支持不了4090,cuda12.0可以支持4090,但是cuda12.0不支持tensorRT7,而apollo的perception模块使用的tensorRT大版本是7,所以无解,我已经放弃在4090编译GPU版本的apollo了

gcx2020 commented 1 year ago

这个问题 只能等官方升级docker镜像来解决问题。

gcx2020 commented 1 year ago

@daohu527 这个问题什么时候能修复呢?

WilliaJing commented 1 year ago

+1+1,我也想问,官方什么时候支持40系列,有计划什么时候出吗

daohu527 commented 1 year ago

this https://github.com/ApolloAuto/apollo/tree/9.x_alpha already update tensorrt 8

WilliaJing commented 1 year ago

this https://github.com/ApolloAuto/apollo/tree/9.x_alpha already update tensorrt 8

wow,thank you very much,I will try to pull the branch.

WilliaJing commented 1 year ago

@daohu527 it is failure. perception modules. image

daohu527 commented 1 year ago

Will check and feedback soon.

daohu527 commented 1 year ago

The reason for the problem is that the Cuda version is too old to support 'compute_89'. Can you check the cuda version?

WilliaJing commented 1 year ago

The reason for the problem is that the Cuda version is too old to support 'compute_89'. Can you check the cuda version?

@daohu527 No,my Cuda version is the latest. 12.2

image
daohu527 commented 1 year ago

Is the cuda version in Apollo docker also the same?

WilliaJing commented 1 year ago

This is printed in the Apollo Docker, but it's the same for me outside the Apollo Docker. Do I still need to download cuda inside the Apollo docker?

Azure-blog commented 1 year ago

I also encountered the same problem.I have CUDA Driver Version 535.86.05 CUDA Version: 12.2.and

nvcc fatal   : Unsupported gpu architecture 'compute_89'

I try to fix it,I check the nvcc --version

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

and according to this https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html, I think i need to upgrade nvidia toolkit, in this page https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local. I found nvidia toolkit for 20.04,However, the containerized version of Apollo 9.x runs on Ubuntu 18.04 for x86 architecture.

CesarLiu commented 12 months ago

1.the best solution for future is build your own docker image with higher cuda version support:maybe just change the base image apollo dockerfile to which you think best-fit: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist?ref_type=heads and build it on your hardware. of course, there are a lot installation files to be adjust, especially some packages compiled by apollo themself. hard but it's sensible.

  1. checking cuda compatibility https://docs.nvidia.com/deploy/cuda-compatibility/ here and according to your nvidia-smi output: you have cuda-driver 535.xx and it support maximum until 12.2, but it doesn't mean you have installed cuda-toolkit on your host or it only supports 12.2. If I'm not understanding it wrong, "CUDA 11.x | >= 450.80.02*" means to support cuda 11 you need a driver version higher than 450.80.02. And you have 535.xx, it should support cuda 11.x, or? how to test it? just pull a nvidia/cuda image with cuda 11.x and run cuda enabled code there (pytorch?) in that container.
  2. nvcc fatal : Unsupported gpu architecture 'compute89' : you know the reason already, this cuda/nvcc version in apollo docker container cannot support your gpu arch. use "nvcc -h | grep compute" to check the highest arch: cuda 10 --> 75, cuda 11 in apollo docker image --> 87. But it doesn't mean your 4090 has no backwards compatibility for previous generations: https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
  3. refer to this issue: https://github.com/ggerganov/llama.cpp/issues/1420, i think apollo build just apply the best performance gcc compile option for your hardware(gpu card), like the mentioned -arch=native, but you can manually set one or more compatible options like 87 or 62 for the cuda version in apollo docker image. Just my limited maybe also wrong understanding about this problem.
CesarLiu commented 12 months ago

https://github.com/ApolloAuto/apollo/issues/14858

daohu527 commented 10 months ago

Can you try the following methods? It will try to downgrade the computing power of the graphics card to temporarily avoid upgrading CUDA.

add code in function https://github.com/ApolloAuto/apollo/blob/a3c851fc5844e0684b9c5108231fcc2c15cebb8e/third_party/gpus/cuda_configure.bzl#L731

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_75":    # add
            capability = "compute_75"    # add

        if capability.startswith("compute_"):
daohu527 commented 10 months ago

CUDA applications built using CUDA Toolkit 11.0 through 11.7 are compatible with the NVIDIA Ada GPU architecture as long as they are built to include kernels in Ampere-native cubin

It seems that the ada architecture is compatible with CUDA Toolkit 11.0 through 11.7. I don’t know why there is a compilation error.

ref

wujf1517 commented 10 months ago

Can you try the following methods? It will try to downgrade the computing power of the graphics card to temporarily avoid upgrading CUDA.

add code in function

https://github.com/ApolloAuto/apollo/blob/a3c851fc5844e0684b9c5108231fcc2c15cebb8e/third_party/gpus/cuda_configure.bzl#L731

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_75":    # add
            capability = "compute_75"    # add

        if capability.startswith("compute_"):

I encountered the same problem (Unsupported gpu architecture 'compute_89') when compiling apollo, after trying this method, the compilation was successful, but the prediction module could not be started, the error is as follows: ———————— ubuntu22.04+Nvidia4090 ———————— [ps@in-dev-docker:/apollo]$ mainboard -d /apollo/modules/prediction/dag/prediction.dag WARNING: Logging before InitGoogleLogging() is written to STDERR I1029 10:56:38.102371 446394 module_argument.cc:81] []command: mainboard -d /apollo/modules/prediction/dag/prediction.dag I1029 10:56:38.102763 446394 global_data.cc:153] []host ip: 192.168.3.90 I1029 10:56:38.104391 446394 module_argument.cc:57] []binaryname is mainboard, processgroup is mainboard_default, has 1 dag conf I1029 10:56:38.104401 446394 module_argument.cc:60] []dag_conf: /apollo/modules/prediction/dag/prediction.dag terminate called after throwing an instance of 'std::runtime_error' what(): nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

define NAN __int_as_float(0x7fffffff)

define POS_INFINITY __int_as_float(0x7f800000)

define NEG_INFINITY __int_as_float(0xff800000)

template device T maximum(T a, T b) { return isnan(a) ? a : (a > b ? a : b); }

template device T minimum(T a, T b) { return isnan(a) ? a : (a < b ? a : b); }

extern "C" global void func_1(float t0, float t1, float aten_relu_flat) { { if (512 blockIdx.x + threadIdx.x<2 ? 1 : 0) { aten_relu_flat[512 blockIdx.x + threadIdx.x] = (((__ldg(t0 + (512 blockIdx.x + threadIdx.x) % 2)) + (ldg(t1 + (512 * blockIdx.x + threadIdx.x) % 2))<0.f ? 1 : 0) ? 0.f : (ldg(t0 + (512 blockIdx.x + threadIdx.x) % 2)) + (__ldg(t1 + (512 blockIdx.x + threadIdx.x) % 2))); } } }

Aborted (core dumped ——————

daohu527 commented 10 months ago

then could you try, I'm not sure if it's an architecture compatibility issue

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_87":    # add
            capability = "compute_87"    # add

        if capability.startswith("compute_"):
wujf1517 commented 10 months ago

then could you try, I'm not sure if it's an architecture compatibility issue

def _compute_cuda_extra_copts(repository_ctx, compute_capabilities):
    copts = []
    for capability in compute_capabilities:
        if capability > "compute_87":    # add
            capability = "compute_87"    # add

        if capability.startswith("compute_"):

Thanks for your reply, I tried "compute_87", but the error was "Unsupported gpu architecture 'compute_87". I then changed it to "compute_86" and the same problem occurred: terminate called after throwing an instance of 'std::runtime_error' what(): nvrtc: error: invalid value for --gpu-architecture (-arch)

daohu527 commented 10 months ago

ok, thanks for feedback, have you tried running the perception module or just prediction module have this error!

I will continue to confirm the issue, but it looks like it may be related to torch.

wujf1517 commented 10 months ago

ok, thanks for feedback, have you tried running the perception module or just preception module have this error!

I will continue to confirm the issue, but it looks like it may be related to torch.

I haven't run the perception module yet. I want to start the prediction module through dreamview. But I found it wouldn't work. So i tried launching the prediction module alone with the following code. The result is the above error.

: mainboard -d /apollo/modules/prediction/dag/prediction.dag

CesarLiu commented 10 months ago

I also think it's a torch problem, since the libtorch is compiled and provided by apollo, and we don't know on which nvidia gpu card they compiled it. Maybe you can try to build torch from source in apollo docker container on your own machine, I followed the following steps to compile and deploy libtorch in apollo docker container for jetson tx2: Following content describes how to build libtorch source on jetson tx2:

  1. Downloading libtorch source: git clone --recursive --branch v1.11.0 http://github.com/pytorch/pytorch
    1. install py deps: pip3 install --no-cache-dir PyYAML typing
    2. set env : export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.1;6.2" && export USE_QNNPACK=0 && export USE_PYTORCH_QNNPACK=0 && export PYTORCH_BUILD_NUMBER=1 && export BUILD_CAFFE2=1 && export USE_NCCL=0 && export PYTORCH_BUILD_VERSION=1.11.0 # without the leading 'v', e.g. 1.3.0 for PyTorch v1.3.0
  2. set env for cpu support: export USE_CUDA=0 (4.1 set env for cpu support: export USE_CUDA=1)
    1. python3 setup.py install (5.1 python3 setup.py install)
    2. mkdir libtorch_cpu && cp -r include libtorch_cpu/ && cp -r lib libtorch_cpu/ && sudo mv libtorch_cpu /usr/local/ (6.1 mkdir libtorch_gpu && cp -r include libtorch_gpu/ && cp -r lib libtorch_gpu/ && sudo mv libtorch_gpu /usr/local/) Attention check which python3 version you seee, here I use python3.7

important here is you choose yourself the torch version you need, the TORCH_CUDA_ARCH_LIST which suits to your card, maybe use 86.

daohu527 commented 10 months ago

update torch to 1.8 maybe solve the problem. Torch needs to dynamically compile cuda files use nvrtc

In view of compatibility with CUDA11.1, the highest version of Torch is 1.10.

ref

wujf1517 commented 10 months ago

I also think it's a torch problem, since the libtorch is compiled and provided by apollo, and we don't know on which nvidia gpu card they compiled it. Maybe you can try to build torch from source in apollo docker container on your own machine, I followed the following steps to compile and deploy libtorch in apollo docker container for jetson tx2: Following content describes how to build libtorch source on jetson tx2:

  1. Downloading libtorch source: git clone --recursive --branch v1.11.0 http://github.com/pytorch/pytorch
  2. install py deps: pip3 install --no-cache-dir PyYAML typing
  3. set env : export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.1;6.2" && export USE_QNNPACK=0 && export USE_PYTORCH_QNNPACK=0 && export PYTORCH_BUILD_NUMBER=1 && export BUILD_CAFFE2=1 && export USE_NCCL=0 && export PYTORCH_BUILD_VERSION=1.11.0 # without the leading 'v', e.g. 1.3.0 for PyTorch v1.3.0
  4. set env for cpu support: export USE_CUDA=0 (4.1 set env for cpu support: export USE_CUDA=1)
  5. python3 setup.py install (5.1 python3 setup.py install)
  6. mkdir libtorch_cpu && cp -r include libtorch_cpu/ && cp -r lib libtorch_cpu/ && sudo mv libtorch_cpu /usr/local/ (6.1 mkdir libtorch_gpu && cp -r include libtorch_gpu/ && cp -r lib libtorch_gpu/ && sudo mv libtorch_gpu /usr/local/) Attention check which python3 version you seee, here I use python3.7

important here is you choose yourself the torch version you need, the TORCH_CUDA_ARCH_LIST which suits to your card, maybe use 86.

I tried the following method, but it seems that libtorch does not compile successfully, the specific information is as follows: ———————— Downloading libtorch source: git clone --recursive --branch v1.10.0 http://github.com/pytorch/pytorch #It's hard to download this way in China. I downloaded it manually install py deps: pip3 install --no-cache-dir PyYAML typing set env : export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.1;6.2" && export USE_QNNPACK=0 && export USE_PYTORCH_QNNPACK=0 && export PYTORCH_BUILD_NUMBER=1 && export BUILD_CAFFE2=1 && export USE_NCCL=0 && export PYTORCH_BUILD_VERSION=1.10.0 # without the leading 'v', e.g. 1.3.0 for PyTorch v1.3.0 set env for cpu support: export USE_CUDA=0 sudo python3 setup.py install #Without sudo will report an error mkdir libtorch_cpu && cp -r include libtorch_cpu/ && cp -r lib libtorch_cpu/ && sudo mv libtorch_cpu /usr/local/ —————————— The specific error message is as follows: —————————— [ps@in-dev-docker:/apollo/pytorch]$ sudo python3 setup.py install Building wheel torch-1.10.0a0+git36449ea -- Building version 1.10.0a0+git36449ea cmake --build . --target install --config Release -- -j 64 [ 0%] Built target clog [ 0%] Built target defs.bzl [ 0%] Built target pthreadpool ...... [100%] Built target torch_python [100%] Built target nnapi_backend Install the project... -- Install configuration: "Release" running install running build running build_py copying caffe2/proto/prof_dag_pb2.py -> build/lib.linux-x86_64-3.6/caffe2/proto copying caffe2/proto/predictor_consts_pb2.py -> build/lib.linux-x86_64-3.6/caffe2/proto ....... writing manifest file 'torch.egg-info/SOURCES.txt' removing '/usr/local/lib/python3.6/dist-packages/torch-1.10.0a0+git36449ea-py3.6.egg-info' (and everything under it) Copying torch.egg-info to /usr/local/lib/python3.6/dist-packages/torch-1.10.0a0+git36449ea-py3.6.egg-info running install_scripts Installing convert-caffe2-to-onnx script to /usr/local/bin Installing convert-onnx-to-caffe2 script to /usr/local/bin Installing torchrun script to /usr/local/bin [ps@in-dev-docker:/apollo/pytorch]$ mkdir libtorch_cpu && cp -r include libtorch_cpu/ && cp -r lib libtorch_cpu/ && sudo mv libtorch_cpu /usr/local/ cp: cannot stat 'include': No such file or directory [ps@in-dev-docker:/apollo/pytorch]$

CesarLiu commented 10 months ago

Hi I think the build/install succeeded.

  1. Since you used sudo, can you check if you have installed torch somewhere, maybe in /usr/local/.
  2. if you check the libtorch_gpu in apollo docker container /usr/local/libtorch_gpu, can you have a look of the include, lib folders, and check if you can find them in /apollo/pytorch, pay attention to folder with "linux-x86_64-3.6"
wujf1517 commented 10 months ago

/usr/local/

Thank you for your help. I copied the bin and include folders found in /usr/local/ to the libtorch_gpu folder and recompile apollo. The compilation was successful, but the prediction module could not be started. The error information is as follows:

[ps@in-dev-docker:/apollo]$ mainboard -d /apollo/modules/prediction/dag/prediction.dag WARNING: Logging before InitGoogleLogging() is written to STDERR I1031 21:15:28.325443 1690631 module_argument.cc:81] []command: mainboard -d /apollo/modules/prediction/dag/prediction.dag I1031 21:15:28.325830 1690631 global_data.cc:153] []host ip: 192.168.3.90 I1031 21:15:28.327428 1690631 module_argument.cc:57] []binaryname is mainboard, processgroup is mainboard_default, has 1 dag conf I1031 21:15:28.327440 1690631 module_argument.cc:60] []dag_conf: /apollo/modules/prediction/dag/prediction.dag E1031 21:15:28.339660 1690631 class_loader_utility.cc:218] [mainboard]LibraryLoadException: libc10.so: cannot open shared object file: No such file or directory E1031 21:15:28.339690 1690631 class_loader_utility.cc:234] [mainboard]shared library failed: /apollo/bazel-bin/modules/prediction/libprediction_component.so E1031 21:15:28.339710 1690631 class_loader_manager.h:70] [mainboard]Invalid class name: PredictionComponent E1031 21:15:28.339725 1690631 module_controller.cc:67] [mainboard]Failed to load module: /apollo/modules/prediction/dag/prediction.dag E1031 21:15:28.339735 1690631 class_loader_utility.cc:256] [mainboard]Attempt to UnloadLibrary lib, but can't find lib: /apollo/bazel-bin/modules/prediction/libprediction_component.so E1031 21:15:28.339745 1690631 mainboard.cc:39] [mainboard]module start error. [ps@in-dev-docker:/apollo]$

CesarLiu commented 10 months ago

https://stackoverflow.com/questions/65710713/importerror-libc10-so-cannot-open-shared-object-file-no-such-file-or-director but at first, please make sure that you have deleted the original apollo-provided libtorch installation in apollo docker container in the dir of /usr/local/libtorch_gpu, which was installed https://github.com/ApolloAuto/apollo/blob/master/docker/build/installers/install_libtorch.sh. and then copy the torch you compiled yourself to the /usr/local/libtorch_gpu dir, afterwards run "ldconfig". I'm not sure if you did in this way for your last comment.

CesarLiu commented 10 months ago

https://github.com/ApolloAuto/apollo/issues/14307

wujf1517 commented 10 months ago

have deleted the original apollo-provided libtorch installation in apollo docker container in the dir of /usr/local/libtorch_gpu, which was installed

Yes, i have deleted the original apollo-provided libtorch installation in apollo docker container in the dir of /usr/local/libtorch_gpu, which was installed, and then copy the torch the compiled myself to the /usr/local/libtorch_gpu dir. But I don't understand how to run "ldconfig". I just run ". /apollo.sh build ".

MingfeiCheng commented 9 months ago

Hi, I also encountered the same problem when running mainboard -d /apollo/modules/prediction/dag/prediction.dag on RTX4090.

The error is:

terminate called after throwing an instance of 'std::runtime_error'
  what():  nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void func_1(float* t0, float* t1, float* aten_relu_flat) {
{
  if (512 * blockIdx.x + threadIdx.x<2 ? 1 : 0) {
    aten_relu_flat[512 * blockIdx.x + threadIdx.x] = (((__ldg(t0 + (512 * blockIdx.x + threadIdx.x) % 2)) + (__ldg(t1 + (512 * blockIdx.x + threadIdx.x) % 2))<0.f ? 1 : 0) ? 0.f : (__ldg(t0 + (512 * blockIdx.x + threadIdx.x) % 2)) + (__ldg(t1 + (512 * blockIdx.x + threadIdx.x) % 2)));
  }
}
}

I also tried upadting /usr/local/libtorch_cpu and /usr/local/libtorch_gpu with the 1.8 version but it still doesn't work... Have you found solutions for this issue? Thanks!

daohu527 commented 9 months ago

@CesarLiu @lovelyzzc @WilliaJing @Azure-blog @gcx2020 We have released a new image that supports 4090 card. You can try below steps.

4090 card support

Nvidia driver version >=520.61.05. If the driver is smaller than the above version, it needs to be upgraded.

Replace docker image

Modify VERSION_X86_64 image version in docker/scripts/dev_start.sh

VERSION_X86_64="dev-x86_64-18.04-20231128_2222"

Start docker and enter docker

bash docker/scripts/dev_start.sh
bash docker/scripts/dev_into.sh

Modify third-party library download link

Modify third_party/centerpoint_infer_op/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "centerpoint_infer_op-x86_64",
        sha256 = "038470fc2e741ebc43aefe365fc23400bc162c1b4cbb74d8c8019f84f2498190",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op_cu118.tar.gz"],
    )

    http_archive(
        name = "centerpoint_infer_op-aarch64",
        sha256 = "e7c933db4237399980c5217fa6a81dff622b00e3a23f0a1deb859743f7977fc1",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op-linux-aarch64-1.0.0.tar.gz"],
    )

Modify third_party/paddleinference/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "paddleinference-x86_64",
        sha256 = "7498df1f9bbaf5580c289a67920eea1a975311764c4b12a62c93b33a081e7520",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.cdn.bcebos.com/archive/paddleinference-cu118-x86.tar.gz"],
    )

    http_archive(
        name = "paddleinference-aarch64",
        sha256 = "048d1d7799ffdd7bd8876e33bc68f28c3af911ff923c10b362340bd83ded04b3",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/paddleinference-linux-aarch64-1.0.0.tar.gz"],
    )

compile

First check whether the .apollo.bazelrc file exists in the workspace. If it exists, delete it first.

Disable the macro in modules/perception/common/inference/tensorrt/rt_legacy.h

// #ifdef __aarch64__

// #endif

build perception module

./apollo.sh build_opt_gpu perception
ScottDeng114514 commented 9 months ago

@daohu527

Thanks for your support ! And I successfully build all modules on my 4090 machine. But when I launch perception module, some errors occur:

cyber_launch start modules/perception/launch/perception_lidar.launch [mainboard]Failed to get model path of center_point please check if model has been installed or APOLLO_MODEL_PATH environment variable has been set correctly.

cyber_launch start modules/perception/launch/perception_trafficlight.launch [mainboard]Failed to get model path of tl_detection_caffe please check if model has been installed or APOLLO_MODEL_PATH environment variable has been set correctly.

cyber_launch start modules/perception/launch/perception_camera_3d.launch [mainboard]Failed to get model path of smoke_torch please check if model has been installed or APOLLO_MODEL_PATH environment variable has been set correctly.

cyber_launch start modules/perception/launch/perception_camera_2d.launch terminate with error

cyber_launch start modules/perception/launch/perception_lane.launch [lane ] E1205 11:27:36.111732 3032414 file.cc:115] [mainboard]File [perception/lane_detection/data/lane.pb.txt] does not exist! [lane ] E1205 11:27:36.111740 3032414 lane_detection_component.cc:62] [perception]Read config failed: perception/lane_detection/data/lane.pb.txt [lane ] E1205 11:27:36.111743 3032414 util.h:147] [perception]InitCameraFrames failed. [lane ] E1205 11:27:36.111747 3032414 component.h:155] [mainboard]Component Init() failed. [lane ] E1205 11:27:36.111804 3032414 module_controller.cc:69] [mainboard]Failed to load module: /apollo/modules/perception/lane_detection/dag/lane_detection.dag [lane ] E1205 11:27:36.111817 3032414 mainboard.cc:39] [mainboard]module start error.

only perception_radar.launch works. Any suggestions, please?

daohu527 commented 9 months ago

we solve one by one

  1. if it prompts that APOLLO_MODEL_PATH cannot be found, then you need to install the model file first, the link is as follows https://github.com/ApolloAuto/apollo/discussions/15212
  2. perception_lane.launch This module is broken, we haven't fixed it yet, you can wait.
xuezhen5267 commented 7 months ago

@CesarLiu @lovelyzzc @WilliaJing @Azure-blog @gcx2020 We have released a new image that supports 4090 card. You can try below steps.

4090 card support

Nvidia driver version >=520.61.05. If the driver is smaller than the above version, it needs to be upgraded.

Replace docker image

Modify VERSION_X86_64 image version in docker/scripts/dev_start.sh

VERSION_X86_64="dev-x86_64-18.04-20231128_2222"

Start docker and enter docker

bash docker/scripts/dev_start.sh
bash docker/scripts/dev_into.sh

Modify third-party library download link

Modify third_party/centerpoint_infer_op/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "centerpoint_infer_op-x86_64",
        sha256 = "038470fc2e741ebc43aefe365fc23400bc162c1b4cbb74d8c8019f84f2498190",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op_cu118.tar.gz"],
    )

    http_archive(
        name = "centerpoint_infer_op-aarch64",
        sha256 = "e7c933db4237399980c5217fa6a81dff622b00e3a23f0a1deb859743f7977fc1",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op-linux-aarch64-1.0.0.tar.gz"],
    )

Modify third_party/paddleinference/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "paddleinference-x86_64",
        sha256 = "7498df1f9bbaf5580c289a67920eea1a975311764c4b12a62c93b33a081e7520",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.cdn.bcebos.com/archive/paddleinference-cu118-x86.tar.gz"],
    )

    http_archive(
        name = "paddleinference-aarch64",
        sha256 = "048d1d7799ffdd7bd8876e33bc68f28c3af911ff923c10b362340bd83ded04b3",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/paddleinference-linux-aarch64-1.0.0.tar.gz"],
    )

compile

First check whether the .apollo.bazelrc file exists in the workspace. If it exists, delete it first.

Disable the macro in modules/perception/common/inference/tensorrt/rt_legacy.h

// #ifdef __aarch64__

// #endif

build perception module

./apollo.sh build_opt_gpu perception

@CesarLiu @lovelyzzc @WilliaJing @Azure-blog @gcx2020 We have released a new image that supports 4090 card. You can try below steps.

4090 card support

Nvidia driver version >=520.61.05. If the driver is smaller than the above version, it needs to be upgraded.

Replace docker image

Modify VERSION_X86_64 image version in docker/scripts/dev_start.sh

VERSION_X86_64="dev-x86_64-18.04-20231128_2222"

Start docker and enter docker

bash docker/scripts/dev_start.sh
bash docker/scripts/dev_into.sh

Modify third-party library download link

Modify third_party/centerpoint_infer_op/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "centerpoint_infer_op-x86_64",
        sha256 = "038470fc2e741ebc43aefe365fc23400bc162c1b4cbb74d8c8019f84f2498190",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op_cu118.tar.gz"],
    )

    http_archive(
        name = "centerpoint_infer_op-aarch64",
        sha256 = "e7c933db4237399980c5217fa6a81dff622b00e3a23f0a1deb859743f7977fc1",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op-linux-aarch64-1.0.0.tar.gz"],
    )

Modify third_party/paddleinference/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "paddleinference-x86_64",
        sha256 = "7498df1f9bbaf5580c289a67920eea1a975311764c4b12a62c93b33a081e7520",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.cdn.bcebos.com/archive/paddleinference-cu118-x86.tar.gz"],
    )

    http_archive(
        name = "paddleinference-aarch64",
        sha256 = "048d1d7799ffdd7bd8876e33bc68f28c3af911ff923c10b362340bd83ded04b3",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/paddleinference-linux-aarch64-1.0.0.tar.gz"],
    )

compile

First check whether the .apollo.bazelrc file exists in the workspace. If it exists, delete it first.

Disable the macro in modules/perception/common/inference/tensorrt/rt_legacy.h

// #ifdef __aarch64__

// #endif

build perception module

./apollo.sh build_opt_gpu perception

when I complete the modification and build, an error occured as follows, could you help to fix this error? (16:31:09) ERROR: /apollo/.cache/bazel/540135163923dd7d5820f3ee4b306b32/external/local_config_tensorrt/BUILD:43:8: Executing genrule @local_config_tensorrt//:tensorrt_lib failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped) cp: cannot stat '/usr/lib/x86_64-linux-gnu/libnvinfer.so.7': No such file or directory (16:31:09) INFO: Elapsed time: 21.648s, Critical Path: 11.17s (16:31:09) INFO: 248 processes: 113 internal, 135 local. (16:31:09) FAILED: Build did NOT complete successfully

below info is shown in container as the output of nvidia-smi | NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 | | 0 NVIDIA GeForce RTX 4070 Off | 00000000:01:00.0 On | N/A |

hellojql commented 6 months ago

@CesarLiu @lovelyzzc @WilliaJing @Azure-blog @gcx2020 We have released a new image that supports 4090 card. You can try below steps.

4090 card support

Nvidia driver version >=520.61.05. If the driver is smaller than the above version, it needs to be upgraded.

Replace docker image

Modify VERSION_X86_64 image version in docker/scripts/dev_start.sh

VERSION_X86_64="dev-x86_64-18.04-20231128_2222"

Start docker and enter docker

bash docker/scripts/dev_start.sh
bash docker/scripts/dev_into.sh

Modify third-party library download link

Modify third_party/centerpoint_infer_op/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "centerpoint_infer_op-x86_64",
        sha256 = "038470fc2e741ebc43aefe365fc23400bc162c1b4cbb74d8c8019f84f2498190",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op_cu118.tar.gz"],
    )

    http_archive(
        name = "centerpoint_infer_op-aarch64",
        sha256 = "e7c933db4237399980c5217fa6a81dff622b00e3a23f0a1deb859743f7977fc1",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op-linux-aarch64-1.0.0.tar.gz"],
    )

Modify third_party/paddleinference/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "paddleinference-x86_64",
        sha256 = "7498df1f9bbaf5580c289a67920eea1a975311764c4b12a62c93b33a081e7520",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.cdn.bcebos.com/archive/paddleinference-cu118-x86.tar.gz"],
    )

    http_archive(
        name = "paddleinference-aarch64",
        sha256 = "048d1d7799ffdd7bd8876e33bc68f28c3af911ff923c10b362340bd83ded04b3",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/paddleinference-linux-aarch64-1.0.0.tar.gz"],
    )

compile

First check whether the .apollo.bazelrc file exists in the workspace. If it exists, delete it first. Disable the macro in modules/perception/common/inference/tensorrt/rt_legacy.h

// #ifdef __aarch64__

// #endif

build perception module

./apollo.sh build_opt_gpu perception

@CesarLiu @lovelyzzc @WilliaJing @Azure-blog @gcx2020 We have released a new image that supports 4090 card. You can try below steps.

4090 card support

Nvidia driver version >=520.61.05. If the driver is smaller than the above version, it needs to be upgraded.

Replace docker image

Modify VERSION_X86_64 image version in docker/scripts/dev_start.sh

VERSION_X86_64="dev-x86_64-18.04-20231128_2222"

Start docker and enter docker

bash docker/scripts/dev_start.sh
bash docker/scripts/dev_into.sh

Modify third-party library download link

Modify third_party/centerpoint_infer_op/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "centerpoint_infer_op-x86_64",
        sha256 = "038470fc2e741ebc43aefe365fc23400bc162c1b4cbb74d8c8019f84f2498190",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op_cu118.tar.gz"],
    )

    http_archive(
        name = "centerpoint_infer_op-aarch64",
        sha256 = "e7c933db4237399980c5217fa6a81dff622b00e3a23f0a1deb859743f7977fc1",
        strip_prefix = "centerpoint_infer_op",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/centerpoint_infer_op-linux-aarch64-1.0.0.tar.gz"],
    )

Modify third_party/paddleinference/workspace.bzl as below

"""Loads the paddlelite library"""

# Sanitize a dependency so that it works correctly from code that includes
# Apollo as a submodule.
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")

def clean_dep(dep):
    return str(Label(dep))

def repo():
    http_archive(
        name = "paddleinference-x86_64",
        sha256 = "7498df1f9bbaf5580c289a67920eea1a975311764c4b12a62c93b33a081e7520",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.cdn.bcebos.com/archive/paddleinference-cu118-x86.tar.gz"],
    )

    http_archive(
        name = "paddleinference-aarch64",
        sha256 = "048d1d7799ffdd7bd8876e33bc68f28c3af911ff923c10b362340bd83ded04b3",
        strip_prefix = "paddleinference",
        urls = ["https://apollo-pkg-beta.bj.bcebos.com/archive/paddleinference-linux-aarch64-1.0.0.tar.gz"],
    )

compile

First check whether the .apollo.bazelrc file exists in the workspace. If it exists, delete it first. Disable the macro in modules/perception/common/inference/tensorrt/rt_legacy.h

// #ifdef __aarch64__

// #endif

build perception module

./apollo.sh build_opt_gpu perception

when I complete the modification and build, an error occured as follows, could you help to fix this error? (16:31:09) ERROR: /apollo/.cache/bazel/540135163923dd7d5820f3ee4b306b32/external/local_config_tensorrt/BUILD:43:8: Executing genrule @local_config_tensorrt//:tensorrt_lib failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped) cp: cannot stat '/usr/lib/x86_64-linux-gnu/libnvinfer.so.7': No such file or directory (16:31:09) INFO: Elapsed time: 21.648s, Critical Path: 11.17s (16:31:09) INFO: 248 processes: 113 internal, 135 local. (16:31:09) FAILED: Build did NOT complete successfully

below info is shown in container as the output of nvidia-smi | NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 | | 0 NVIDIA GeForce RTX 4070 Off | 00000000:01:00.0 On | N/A |

TF_TENSORRT_VERSION is not specified in container, and " config = find_cuda_config(repository_ctx, find_cuda_config_path, ["tensorrt"]) trt_version = config["tensorrt_version"] " gets trt_version = 7, but 8 is installed in contaner, you can fix it by export TF_TENSORRT_VERSION="8.5.2" before do build