rangenet.launch运行报错问题

yuyaoliang commented 1 year ago

我的配置是ubuntu20.04，GTX950M，CUDA11.3 CUDNN8.8.0 tensorrt8.4.1.5 libtorch1.10 编译正常，跑rosbag.launch也没问题，但是跑rangenet.launch时报以下错误： [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_582 + node_of_583.weight] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_585 + node_of_586.weight] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_585 + node_of_586.bias] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:22] [E] 1: [wrapper.cpp::plainGemm::206] Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED) [07/28/2023-10:51:22] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) terminate called after throwing an instance of 'std::runtime_error' what(): failed to build tensorrt engine [ros1_demo-2] process has died [pid 98179, exit code -6, cmd /home/liangyuyao/Lidar/rangetnet_pp/devel/lib/rangenet_pp/ros1_demo __name:=ros1_demo __log:=/home/liangyuyao/.ros/log/5d531702-2cf1-11ee-ba9b-bdc9299ca300/ros1_demo-2.log]. log file: /home/liangyuyao/.ros/log/5d531702-2cf1-11ee-ba9b-bdc9299ca300/ros1_demo-2*.log

请问怎么解决呢。我的launch文件的model_dir也已经进行了对应的修改

Natsu-Akatsuki commented 1 year ago

1）首先，TensorRT 会生成后缀为 .trt 的 engine 文件，你这个 what(): failed to build tensorrt engine，表示没有生成成功 2）再看你这个日志 Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED)，非我这边的问题，有较大的可能是你配置的环境存在问题

PS：早上审错题了，我以为是没有找到对应的模型文件路径，结果是还没生成出来就 GG 了

Natsu-Akatsuki commented 1 year ago

我寻思你使用 ROS1 catkin build 的话，把 logs/rangenet_pp/build.cmake.log post 一下

yuyaoliang commented 1 year ago

感谢回复，这是我的build.cmake.log，麻烦看一下: Not searching for unused variables given on the command line. [36m--[0m The C compiler identification is GNU 9.4.0 [36m--[0m The CXX compiler identification is GNU 9.4.0 [36m--[0m Check for working C compiler: /usr/bin/cc [36m--[0m Check for working C compiler: /usr/bin/cc -- works [36m--[0m Detecting C compiler ABI info [36m--[0m Detecting C compiler ABI info - done [36m--[0m Detecting C compile features [36m--[0m Detecting C compile features - done [36m--[0m Check for working CXX compiler: /usr/bin/c++ [36m--[0m Check for working CXX compiler: /usr/bin/c++ -- works [36m--[0m Detecting CXX compiler ABI info [36m--[0m Detecting CXX compiler ABI info - done [36m--[0m Detecting CXX compile features [36m--[0m Detecting CXX compile features - done [36m--[0m Looking for pthread.h [36m--[0m Looking for pthread.h - found [36m--[0m Performing Test CMAKE_HAVE_LIBC_PTHREAD [36m--[0m Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed [36m--[0m Looking for pthread_create in pthreads [36m--[0m Looking for pthread_create in pthreads - not found [36m--[0m Looking for pthread_create in pthread [36m--[0m Looking for pthread_create in pthread - found [36m--[0m Found Threads: TRUE [36m--[0m Found CUDA: /usr/local/cuda (found version "11.3") [36m--[0m Using CATKIN_DEVEL_PREFIX: /home/liangyuyao/Lidar/rangenet_pp/devel/.private/rangenet_pp [36m--[0m Using CMAKE_PREFIX_PATH: /home/liangyuyao/Lidar/rangenet_pp/devel;/opt/ros/noetic [36m--[0m This workspace overlays: /home/liangyuyao/Lidar/rangenet_pp/devel;/opt/ros/noetic [36m--[0m Found PythonInterp: /home/liangyuyao/anaconda3/bin/python3 (found suitable version "3.9.13", minimum required is "3") [36m--[0m Using PYTHON_EXECUTABLE: /home/liangyuyao/anaconda3/bin/python3 [36m--[0m Using Debian Python package layout [36m--[0m Found PY_em: /home/liangyuyao/anaconda3/lib/python3.9/site-packages/em.py [36m--[0m Using empy: /home/liangyuyao/anaconda3/lib/python3.9/site-packages/em.py [36m--[0m Using CATKIN_ENABLE_TESTING: ON [36m--[0m Call enable_testing() [36m--[0m Using CATKIN_TEST_RESULTS_DIR: /home/liangyuyao/Lidar/rangenet_pp/build/rangenet_pp/test_results [36m--[0m Forcing gtest/gmock from source, though one was otherwise available. [36m--[0m Found gtest sources under '/usr/src/googletest': gtests will be built [36m--[0m Found gmock sources under '/usr/src/googletest': gmock will be built [36m--[0m Found PythonInterp: /home/liangyuyao/anaconda3/bin/python3 (found version "3.9.13") [36m--[0m Using Python nosetests: /usr/bin/nosetests3 [36m--[0m catkin 0.8.10 [36m--[0m BUILD_SHARED_LIBS is on [36m--[0m Using these message generators: gencpp;geneus;genlisp;gennodejs;genpy [36m--[0m [32m[INFO] CMAKE_BUILD_TYPE：[m [36m--[0m [32m[INFO] ROS1 is available![m [36m--[0m Found CUDA: /usr/local/cuda (found version "11.3") [36m--[0m Caffe2: CUDA detected: 11.3 [36m--[0m Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc [36m--[0m Caffe2: CUDA toolkit directory: /usr/local/cuda [36m--[0m Caffe2: Header version is: 11.3 [36m--[0m Found CUDNN: /usr/local/cuda/lib64/libcudnn.so [36m--[0m Found cuDNN: v8.8.0 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so) [36m--[0m /usr/local/cuda/lib64/libnvrtc.so shorthash is 8aa72235 [36m--[0m Automatic GPU detection failed. Building for common architectures. [36m--[0m Autodetected CUDA architecture(s): 3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6;8.6+PTX [36m--[0m Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86 [36m--[0m Found Torch: /home/liangyuyao/software/libtorch/lib/libtorch.so [36m--[0m Checking for module 'eigen3' [36m--[0m Found eigen3, version 3.3.7 [36m--[0m Found Eigen: /usr/include/eigen3 (Required is at least version "3.1") [36m--[0m Eigen found (include: /usr/include/eigen3, version: 3.3.7) [36m--[0m Checking for module 'flann' [36m--[0m Found flann, version 1.9.1 [36m--[0m Found FLANN: /usr/lib/x86_64-linux-gnu/libflann_cpp.so [36m--[0m The imported target "vtkParseOGLExt" references the file "/usr/bin/vtkParseOGLExt-7.1" but this file does not exist. Possible reasons include:

The file was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and contained "/usr/lib/cmake/vtk-7.1/VTKTargets.cmake" but not all the files it references.

[36m--[0m The imported target "vtkRenderingPythonTkWidgets" references the file "/usr/lib/x86_64-linux-gnu/libvtkRenderingPythonTkWidgets.so" but this file does not exist. Possible reasons include:

The file was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and contained "/usr/lib/cmake/vtk-7.1/VTKTargets.cmake" but not all the files it references.

[36m--[0m The imported target "vtk" references the file "/usr/bin/vtk" but this file does not exist. Possible reasons include:

The file was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and contained "/usr/lib/cmake/vtk-7.1/VTKTargets.cmake" but not all the files it references.

[36m--[0m The imported target "pvtk" references the file "/usr/bin/pvtk" but this file does not exist. Possible reasons include:

The file was deleted, renamed, or moved to another location.
An install or uninstall procedure did not complete successfully.
The installation package was faulty and contained "/usr/lib/cmake/vtk-7.1/VTKTargets.cmake" but not all the files it references.

[36m--[0m Checking for module 'libusb-1.0' [36m--[0m Found libusb-1.0, version 1.0.23 [36m--[0m Found USB_10: /usr/lib/x86_64-linux-gnu/libusb-1.0.so [36m--[0m OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) [36m--[0m OpenNI2 found (include: /usr/include/openni2, lib: /usr/lib/libOpenNI2.so) [36m--[0m Found libusb-1.0: /usr/include [36m--[0m OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) [36m--[0m OpenNI2 found (include: /usr/include/openni2, lib: /usr/lib/libOpenNI2.so) [36m--[0m Found Qhull: optimized;/usr/lib/x86_64-linux-gnu/libqhull.so;debug;/usr/lib/x86_64-linux-gnu/libqhull.so [36m--[0m QHULL found (include: /usr/include, lib: optimized;/usr/lib/x86_64-linux-gnu/libqhull.so;debug;/usr/lib/x86_64-linux-gnu/libqhull.so) [36m--[0m OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) [36m--[0m Found PCL_COMMON: /usr/lib/x86_64-linux-gnu/libpcl_common.so [36m--[0m Found PCL_KDTREE: /usr/lib/x86_64-linux-gnu/libpcl_kdtree.so [36m--[0m Found PCL_OCTREE: /usr/lib/x86_64-linux-gnu/libpcl_octree.so [36m--[0m Found PCL_SEARCH: /usr/lib/x86_64-linux-gnu/libpcl_search.so [36m--[0m Found PCL_SAMPLE_CONSENSUS: /usr/lib/x86_64-linux-gnu/libpcl_sample_consensus.so [36m--[0m Found PCL_FILTERS: /usr/lib/x86_64-linux-gnu/libpcl_filters.so [36m--[0m Found PCL_2D: /usr/include/pcl-1.10 [36m--[0m Found PCL_GEOMETRY: /usr/include/pcl-1.10 [36m--[0m Found PCL_IO: /usr/lib/x86_64-linux-gnu/libpcl_io.so [36m--[0m Found PCL_FEATURES: /usr/lib/x86_64-linux-gnu/libpcl_features.so [36m--[0m Found PCL_ML: /usr/lib/x86_64-linux-gnu/libpcl_ml.so [36m--[0m Found PCL_SEGMENTATION: /usr/lib/x86_64-linux-gnu/libpcl_segmentation.so [36m--[0m Found PCL_VISUALIZATION: /usr/lib/x86_64-linux-gnu/libpcl_visualization.so [36m--[0m Found PCL_SURFACE: /usr/lib/x86_64-linux-gnu/libpcl_surface.so [36m--[0m Found PCL_REGISTRATION: /usr/lib/x86_64-linux-gnu/libpcl_registration.so [36m--[0m Found PCL_KEYPOINTS: /usr/lib/x86_64-linux-gnu/libpcl_keypoints.so [36m--[0m Found PCL_TRACKING: /usr/lib/x86_64-linux-gnu/libpcl_tracking.so [36m--[0m Found PCL_RECOGNITION: /usr/lib/x86_64-linux-gnu/libpcl_recognition.so [36m--[0m Found PCL_STEREO: /usr/lib/x86_64-linux-gnu/libpcl_stereo.so [36m--[0m Found PCL_APPS: /usr/lib/x86_64-linux-gnu/libpcl_apps.so [36m--[0m Found PCL_IN_HAND_SCANNER: /usr/include/pcl-1.10 [36m--[0m Found PCL_POINT_CLOUD_EDITOR: /usr/include/pcl-1.10 [36m--[0m Found PCL_OUTOFCORE: /usr/lib/x86_64-linux-gnu/libpcl_outofcore.so [36m--[0m Found PCL_PEOPLE: /usr/lib/x86_64-linux-gnu/libpcl_people.so [36m--[0m [32m[INFO] PCL_VERSION is 1.10.0[m [36m--[0m [32m[INFO] CUDA is available![m [36m--[0m [32m[INFO] CUDA Libs: /usr/local/cuda/lib64/libcudart.so[m [36m--[0m [32m[INFO] CUDA Headers: /usr/local/cuda/include[m [36m--[0m [32m[INFO] CUDNN is available![m [36m--[0m [32m[INFO] CUDNN_LIBRARY: /usr/local/cuda/lib64/libcudnn.so[m [36m--[0m [32m[INFO] TensorRT is not installed through DEB package, should specify the environment parameter TENSORRT_DIR explicitly[m [36m--[0m [32m[INFO] NVINFER: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvinfer.so[m [36m--[0m [32m[INFO] NVPARSERS: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvparsers.so[m [36m--[0m [32m[INFO] NVINFER_PLUGIN: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvinfer_plugin.so[m [36m--[0m [32m[INFO] NVONNX_PARSER: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvonnxparser.so[m [36m--[0m [32m[INFO] TensorRT is available![m [36m--[0m Configuring done [36m--[0m Generating done [36m--[0m Build files have been written to: /home/liangyuyao/Lidar/rangenet_pp/build/rangenet_pp

Natsu-Akatsuki commented 1 year ago

或显存不够 1）根据第二段日志，发现环境配置没有问题 2）根据第一段日志，发现是引擎优化时出现报错，然后退出，请查看自己的显卡是否有足够的显存来生成引擎（如文档所述大概为 3G 左右，但那个应该是初始的时候，后面优化到一定程度后就上 5G了）

PS：通过 watch -c nvidia-smi 查看运行时占用情况

yuyaoliang commented 1 year ago

感谢回复，我看了下占用情况 | 0 NVIDIA GeForce GTX 950M Off | 00000000:01:00.0 Off | N/A | | N/A 51C P8 N/A / 200W | 1486MiB / 2048MiB | 99% Default | | | | N/A | 大概1500MB的时候程序就终止了，终端报错如下： [07/29/2023-16:52:12] [E] 1: [wrapper.cpp::plainGemm::206] Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED) [07/29/2023-16:52:12] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) terminate called after throwing an instance of 'std::runtime_error' what(): failed to build tensorrt engine

Natsu-Akatsuki commented 1 year ago

嗯，那大概率就是显卡的显存完全不够用了。一方面可以充钱解决硬件的问题。另一方面，我后续看看有没有可以在优化时减少显存的方案...

yuyaoliang commented 1 year ago

好的，感谢大佬		L

@. | ---- 回复的原邮件 ---- | 发件人 | Huang @.> | | 发送日期 | 2023年07月29日 17:13 | | 收件人 | Natsu-Akatsuki/RangeNetTrt8 @.> | | 抄送人 | yuyaoliang @.>, Author @.***> | | 主题 | Re: [Natsu-Akatsuki/RangeNetTrt8] rangenet.launch运行报错问题 (Issue #9) |

嗯，那大概率就是显卡的显存完全不够用了。一方面可以充钱解决硬件的问题。另一方面，我后续看看有没有可以在优化时减少显存的方案...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Natsu-Akatsuki / RangeNetTrt8

rangenet.launch运行报错问题 #9