Natsu-Akatsuki / RangeNetTrt8

tensorrt8 && cuda && libtorch implementation of rangenet++
MIT License
44 stars 9 forks source link

rangenet.launch运行报错问题 #9

Closed yuyaoliang closed 1 year ago

yuyaoliang commented 1 year ago

我的配置是ubuntu20.04,GTX950M,CUDA11.3 CUDNN8.8.0 tensorrt8.4.1.5 libtorch1.10 编译正常,跑rosbag.launch也没问题,但是跑rangenet.launch时报以下错误: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_582 + node_of_583.weight] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_585 + node_of_586.weight] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:11] [W] Weights [name=node_of_585 + node_of_586.bias] had the following issues when converted to FP16: [07/28/2023-10:51:11] [W] - Subnormal FP16 values detected. [07/28/2023-10:51:11] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights. [07/28/2023-10:51:22] [E] 1: [wrapper.cpp::plainGemm::206] Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED) [07/28/2023-10:51:22] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) terminate called after throwing an instance of 'std::runtime_error' what(): failed to build tensorrt engine [ros1_demo-2] process has died [pid 98179, exit code -6, cmd /home/liangyuyao/Lidar/rangetnet_pp/devel/lib/rangenet_pp/ros1_demo __name:=ros1_demo __log:=/home/liangyuyao/.ros/log/5d531702-2cf1-11ee-ba9b-bdc9299ca300/ros1_demo-2.log]. log file: /home/liangyuyao/.ros/log/5d531702-2cf1-11ee-ba9b-bdc9299ca300/ros1_demo-2*.log

请问怎么解决呢。我的launch文件的model_dir也已经进行了对应的修改

Natsu-Akatsuki commented 1 year ago

1)首先,TensorRT 会生成后缀为 .trt 的 engine 文件,你这个 what(): failed to build tensorrt engine,表示没有生成成功 2)再看你这个日志 Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED),非我这边的问题,有较大的可能是你配置的环境存在问题

PS:早上审错题了,我以为是没有找到对应的模型文件路径,结果是还没生成出来就 GG 了

Natsu-Akatsuki commented 1 year ago

我寻思你使用 ROS1 catkin build 的话,把 logs/rangenet_pp/build.cmake.log post 一下

yuyaoliang commented 1 year ago

感谢回复,这是我的build.cmake.log,麻烦看一下: Not searching for unused variables given on the command line. -- The C compiler identification is GNU 9.4.0 -- The CXX compiler identification is GNU 9.4.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: /usr/local/cuda (found version "11.3") -- Using CATKIN_DEVEL_PREFIX: /home/liangyuyao/Lidar/rangenet_pp/devel/.private/rangenet_pp -- Using CMAKE_PREFIX_PATH: /home/liangyuyao/Lidar/rangenet_pp/devel;/opt/ros/noetic -- This workspace overlays: /home/liangyuyao/Lidar/rangenet_pp/devel;/opt/ros/noetic -- Found PythonInterp: /home/liangyuyao/anaconda3/bin/python3 (found suitable version "3.9.13", minimum required is "3") -- Using PYTHON_EXECUTABLE: /home/liangyuyao/anaconda3/bin/python3 -- Using Debian Python package layout -- Found PY_em: /home/liangyuyao/anaconda3/lib/python3.9/site-packages/em.py -- Using empy: /home/liangyuyao/anaconda3/lib/python3.9/site-packages/em.py -- Using CATKIN_ENABLE_TESTING: ON -- Call enable_testing() -- Using CATKIN_TEST_RESULTS_DIR: /home/liangyuyao/Lidar/rangenet_pp/build/rangenet_pp/test_results -- Forcing gtest/gmock from source, though one was otherwise available. -- Found gtest sources under '/usr/src/googletest': gtests will be built -- Found gmock sources under '/usr/src/googletest': gmock will be built -- Found PythonInterp: /home/liangyuyao/anaconda3/bin/python3 (found version "3.9.13") -- Using Python nosetests: /usr/bin/nosetests3 -- catkin 0.8.10 -- BUILD_SHARED_LIBS is on -- Using these message generators: gencpp;geneus;genlisp;gennodejs;genpy -- [INFO] CMAKE_BUILD_TYPE: -- [INFO] ROS1 is available! -- Found CUDA: /usr/local/cuda (found version "11.3") -- Caffe2: CUDA detected: 11.3 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /usr/local/cuda -- Caffe2: Header version is: 11.3 -- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so -- Found cuDNN: v8.8.0 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so) -- /usr/local/cuda/lib64/libnvrtc.so shorthash is 8aa72235 -- Automatic GPU detection failed. Building for common architectures. -- Autodetected CUDA architecture(s): 3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6;8.6+PTX -- Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86 -- Found Torch: /home/liangyuyao/software/libtorch/lib/libtorch.so -- Checking for module 'eigen3' -- Found eigen3, version 3.3.7 -- Found Eigen: /usr/include/eigen3 (Required is at least version "3.1") -- Eigen found (include: /usr/include/eigen3, version: 3.3.7) -- Checking for module 'flann' -- Found flann, version 1.9.1 -- Found FLANN: /usr/lib/x86_64-linux-gnu/libflann_cpp.so -- The imported target "vtkParseOGLExt" references the file "/usr/bin/vtkParseOGLExt-7.1" but this file does not exist. Possible reasons include:

-- The imported target "vtkRenderingPythonTkWidgets" references the file "/usr/lib/x86_64-linux-gnu/libvtkRenderingPythonTkWidgets.so" but this file does not exist. Possible reasons include:

-- The imported target "vtk" references the file "/usr/bin/vtk" but this file does not exist. Possible reasons include:

-- The imported target "pvtk" references the file "/usr/bin/pvtk" but this file does not exist. Possible reasons include:

-- Checking for module 'libusb-1.0' -- Found libusb-1.0, version 1.0.23 -- Found USB_10: /usr/lib/x86_64-linux-gnu/libusb-1.0.so -- OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) -- OpenNI2 found (include: /usr/include/openni2, lib: /usr/lib/libOpenNI2.so) -- Found libusb-1.0: /usr/include -- OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) -- OpenNI2 found (include: /usr/include/openni2, lib: /usr/lib/libOpenNI2.so) -- Found Qhull: optimized;/usr/lib/x86_64-linux-gnu/libqhull.so;debug;/usr/lib/x86_64-linux-gnu/libqhull.so -- QHULL found (include: /usr/include, lib: optimized;/usr/lib/x86_64-linux-gnu/libqhull.so;debug;/usr/lib/x86_64-linux-gnu/libqhull.so) -- OpenNI found (include: /usr/include/ni, lib: /usr/lib/libOpenNI.so) -- Found PCL_COMMON: /usr/lib/x86_64-linux-gnu/libpcl_common.so -- Found PCL_KDTREE: /usr/lib/x86_64-linux-gnu/libpcl_kdtree.so -- Found PCL_OCTREE: /usr/lib/x86_64-linux-gnu/libpcl_octree.so -- Found PCL_SEARCH: /usr/lib/x86_64-linux-gnu/libpcl_search.so -- Found PCL_SAMPLE_CONSENSUS: /usr/lib/x86_64-linux-gnu/libpcl_sample_consensus.so -- Found PCL_FILTERS: /usr/lib/x86_64-linux-gnu/libpcl_filters.so -- Found PCL_2D: /usr/include/pcl-1.10 -- Found PCL_GEOMETRY: /usr/include/pcl-1.10 -- Found PCL_IO: /usr/lib/x86_64-linux-gnu/libpcl_io.so -- Found PCL_FEATURES: /usr/lib/x86_64-linux-gnu/libpcl_features.so -- Found PCL_ML: /usr/lib/x86_64-linux-gnu/libpcl_ml.so -- Found PCL_SEGMENTATION: /usr/lib/x86_64-linux-gnu/libpcl_segmentation.so -- Found PCL_VISUALIZATION: /usr/lib/x86_64-linux-gnu/libpcl_visualization.so -- Found PCL_SURFACE: /usr/lib/x86_64-linux-gnu/libpcl_surface.so -- Found PCL_REGISTRATION: /usr/lib/x86_64-linux-gnu/libpcl_registration.so -- Found PCL_KEYPOINTS: /usr/lib/x86_64-linux-gnu/libpcl_keypoints.so -- Found PCL_TRACKING: /usr/lib/x86_64-linux-gnu/libpcl_tracking.so -- Found PCL_RECOGNITION: /usr/lib/x86_64-linux-gnu/libpcl_recognition.so -- Found PCL_STEREO: /usr/lib/x86_64-linux-gnu/libpcl_stereo.so -- Found PCL_APPS: /usr/lib/x86_64-linux-gnu/libpcl_apps.so -- Found PCL_IN_HAND_SCANNER: /usr/include/pcl-1.10 -- Found PCL_POINT_CLOUD_EDITOR: /usr/include/pcl-1.10 -- Found PCL_OUTOFCORE: /usr/lib/x86_64-linux-gnu/libpcl_outofcore.so -- Found PCL_PEOPLE: /usr/lib/x86_64-linux-gnu/libpcl_people.so -- [INFO] PCL_VERSION is 1.10.0 -- [INFO] CUDA is available! -- [INFO] CUDA Libs: /usr/local/cuda/lib64/libcudart.so -- [INFO] CUDA Headers: /usr/local/cuda/include -- [INFO] CUDNN is available! -- [INFO] CUDNN_LIBRARY: /usr/local/cuda/lib64/libcudnn.so -- [INFO] TensorRT is not installed through DEB package, should specify the environment parameter TENSORRT_DIR explicitly -- [INFO] NVINFER: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvinfer.so -- [INFO] NVPARSERS: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvparsers.so -- [INFO] NVINFER_PLUGIN: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvinfer_plugin.so -- [INFO] NVONNX_PARSER: /home/liangyuyao/software/TensorRT-8.4.1.5/lib/libnvonnxparser.so -- [INFO] TensorRT is available! -- Configuring done -- Generating done -- Build files have been written to: /home/liangyuyao/Lidar/rangenet_pp/build/rangenet_pp

Natsu-Akatsuki commented 1 year ago

或显存不够 1)根据第二段日志,发现环境配置没有问题 2)根据第一段日志,发现是引擎优化时出现报错,然后退出,请查看自己的显卡是否有足够的显存来生成引擎(如文档所述大概为 3G 左右,但那个应该是初始的时候,后面优化到一定程度后就上 5G了)

PS:通过 watch -c nvidia-smi 查看运行时占用情况

yuyaoliang commented 1 year ago

感谢回复,我看了下占用情况 | 0 NVIDIA GeForce GTX 950M Off | 00000000:01:00.0 Off | N/A | | N/A 51C P8 N/A / 200W | 1486MiB / 2048MiB | 99% Default | | | | N/A | 大概1500MB的时候程序就终止了,终端报错如下: [07/29/2023-16:52:12] [E] 1: [wrapper.cpp::plainGemm::206] Error Code 1: Cublas (CUBLAS_STATUS_NOT_SUPPORTED) [07/29/2023-16:52:12] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) terminate called after throwing an instance of 'std::runtime_error' what(): failed to build tensorrt engine

Natsu-Akatsuki commented 1 year ago

嗯,那大概率就是显卡的显存完全不够用了。一方面可以充钱解决硬件的问题。另一方面,我后续看看有没有可以在优化时减少显存的方案...

yuyaoliang commented 1 year ago
好的,感谢大佬 L

@. | ---- 回复的原邮件 ---- | 发件人 | Huang @.> | | 发送日期 | 2023年07月29日 17:13 | | 收件人 | Natsu-Akatsuki/RangeNetTrt8 @.> | | 抄送人 | yuyaoliang @.>, Author @.***> | | 主题 | Re: [Natsu-Akatsuki/RangeNetTrt8] rangenet.launch运行报错问题 (Issue #9) |

嗯,那大概率就是显卡的显存完全不够用了。一方面可以充钱解决硬件的问题。另一方面,我后续看看有没有可以在优化时减少显存的方案...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>