Natsu-Akatsuki / RangeNetTrt8

tensorrt8 && cuda && libtorch implementation of rangenet++
MIT License
44 stars 9 forks source link

数据异常和KNN后处理问题 #15

Closed yuyaoliang closed 9 months ago

yuyaoliang commented 10 months ago

大佬您好,我用自己的固态雷达数据训练的网络,深度图大小设置为128*1024,用python训练推理和使用KNN都没问题,但是用您的代码出现两个问题: 1.config->setFlag(nvinfer1::BuilderFlag::kFP16);设置为16时推理标签结果全为-1,我改为TF32后,情况好转,但仍有部分区域标签结果为-1,显示为黑的 微信图片_20231114165538

2.上诉结果是我在未开启KNN后处理的情况下得到的,我开启后处理 isPostprocess = true后推理报错如下:

[[11/14/2023-16:21:26] [I] I/O dimensions respectively are: [ 1 5 128 1024 ]
[11/14/2023-16:21:26] [I] I/O dimensions respectively are: [ 1 1 128 1024 ]
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [215,0,0], thread: [34,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [215,0,0], thread: [39,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [7,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [12,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [17,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [22,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [27,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [37,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [47,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [52,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [307,0,0], thread: [57,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:111: operator(): block: [588,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from memcpy_and_sync at /pytorch/c10/cuda/CUDAFunctions.h:75 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fd5a74940db in /home/nvidia/software/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x1479d72 (0x7fd5c45d2d72 in /home/nvidia/software/libtorch/lib/libtorch_cuda_cu.so)
frame #2: <unknown function> + 0x14179b5 (0x7fd607ea29b5 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #3: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x4e (0x7fd607ea350e in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #4: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x179 (0x7fd60852ee79 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #5: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x145c (0x7fd60812457c in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1f26533 (0x7fd6089b1533 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x1dd8728 (0x7fd608863728 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #8: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x14a (0x7fd6083da78a in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x3465973 (0x7fd609ef0973 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x3465e7b (0x7fd609ef0e7b in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1ea (0x7fd6084577da in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #12: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x17e (0x7fd608122c1e in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x1fc9599 (0x7fd608a54599 in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #14: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20b (0x7fd608740f5b in /home/nvidia/software/libtorch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x4ff7 (0x7fd68efa1ff7 in /home/nvidia/Lidar/rangenet/devel/lib/libpostprocess.so)
frame #16: Postprocess::postprocessKNN(float const*, float*, float const*, float const*, float const*, int, int*) + 0x14ba (0x7fd68efa3f5a in /home/nvidia/Lidar/rangenet/devel/lib/libpostprocess.so)
frame #17: rangenet::segmentation::NetTensorRT::doInfer(pcl::PointCloud<pcl::PointXYZI> const&, int*) + 0x4f9 (0x7fd68efd13d9 in /home/nvidia/Lidar/rangenet/devel/lib/librangenet_lib.so)
frame #18: <unknown function> + 0xb24f (0x563acf88024f in /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo)
frame #19: <unknown function> + 0x13402 (0x563acf888402 in /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo)
frame #20: <unknown function> + 0x1546d (0x563acf88a46d in /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo)
frame #21: ros::SubscriptionQueue::call() + 0x989 (0x7fd68f1cc139 in /opt/ros/noetic/lib/libroscpp.so)
frame #22: ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) + 0x112 (0x7fd68f17a172 in /opt/ros/noetic/lib/libroscpp.so)
frame #23: ros::CallbackQueue::callAvailable(ros::WallDuration) + 0x323 (0x7fd68f17b883 in /opt/ros/noetic/lib/libroscpp.so)
frame #24: ros::SingleThreadedSpinner::spin(ros::CallbackQueue*) + 0x5df (0x7fd68f1cefcf in /opt/ros/noetic/lib/libroscpp.so)
frame #25: ros::spin() + 0x2f (0x7fd68f1b721f in /opt/ros/noetic/lib/libroscpp.so)
frame #26: <unknown function> + 0xa0c5 (0x563acf87f0c5 in /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo)
frame #27: __libc_start_main + 0xf3 (0x7fd5c2b8a083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: <unknown function> + 0xa25e (0x563acf87f25e in /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo)

[rangenet_pp-2] process has died [pid 15309, exit code -6, cmd /home/nvidia/Lidar/rangenet/devel/lib/rangenet_pp/ros1_demo __name:=rangenet_pp __log:=/home/nvidia/.ros/log/c9cdd400-82c6-11ee-ab8b-2159e06240c5/rangenet_pp-2.log].
log file: /home/nvidia/.ros/log/c9cdd400-82c6-11ee-ab8b-2159e06240c5/rangenet_pp-2*.log]

想问大佬有什么建议或思路吗

Natsu-Akatsuki commented 10 months ago

您好,有些超参数写死了,不知道你改了没有,例如:https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/utils/postprocess.cpp#L49

Natsu-Akatsuki commented 10 months ago

第一个问题,你根据 TensorRT 的版本进行切换了么,例如: https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/network/netTensorRT.cpp#L287

yuyaoliang commented 9 months ago

您好,有些超参数写死了,不知道你改了没有,例如:

https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/utils/postprocess.cpp#L49

我在这里做了相应的修改,不过我不理解为什么是{1, 1, 64, 2048}而不是{1, 5, 64, 2048}

  const at::Tensor range_img_torch =
      torch::from_blob((float *)range_img, {1, 1, 128, 1024}).to(device_);
  at::Tensor range_torch =
      torch::from_blob((float *)range_arr, {point_num}).to(device_);
  const at::Tensor label_img_torch =
      torch::from_blob((float *)label_img, {1, 1, 128, 1024})
          .to(device_)
          .toType(torch::kFloat32);
  const at::Tensor px_torch =
      torch::from_blob((float *)pxs, {point_num}).to(device_);
  const at::Tensor pys_torch =
      torch::from_blob((float *)pys, {point_num}).to(device_);
  int img_width = 1024;
yuyaoliang commented 9 months ago

第一个问题,你根据 TensorRT 的版本进行切换了么,例如:

https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/network/netTensorRT.cpp#L287

这里我也做了相应的修改,但还是会有部分点云标签为-1

 config->setFlag(nvinfer1::BuilderFlag::kTF32);
  config->setMaxWorkspaceSize(5UL << 30); 
  config->setFlag(BuilderFlag::kPREFER_PRECISION_CONSTRAINTS);

  auto network = std::unique_ptr<nvinfer1::INetworkDefinition>(
      builder->createNetworkV2(explicitBatch));
  assert(network != nullptr);

  // generate a parser to get weights from onnx file
  auto parser = std::unique_ptr<nvonnxparser::IParser>(
      nvonnxparser::createParser(*network, _gLogger));

  parser->parseFromFile(onnx_path.c_str(),
                        static_cast<int>(Logger::Severity::kWARNING));

  auto ktop_layer = network->addTopK(*network->getOutput(0),
                                     nvinfer1::TopKOperation::kMAX, 1, 2);
  assert(ktop_layer != nullptr);
  ktop_layer->setName("topKLayer");
  // 将原始网络的输出替换为 addTopK 层的输出
  network->unmarkOutput(*network->getOutput(0));
  std::cout << network->getNbOutputs() << std::endl;
  network->markOutput(*ktop_layer->getOutput(1));

  // 防止该层转换为 16 位时数据溢出
  // 1-236 层的权值类型设置为 FP16
  // 236-237 层的权值类型设置为 FP32
  // 237 层的权值类型设置为 INT32
  // 使用 TensorRT 8.2 版本时 start 应设置为 236
  // 使用 TensorRT 8.4 版本时 start 应设置为 235
  int start = 177;
  int end = 179;
  for (int i = start; i <= end; i++) {
    auto layer = network->getLayer(i);
    std::string layerName = layer->getName();
    layer->setPrecision(nvinfer1::DataType::kFLOAT);
  }
  auto layer = network->getLayer(180);
  // note:ktop 层输出为整型
  std::string layerName = layer->getName();
  layer->setPrecision(nvinfer1::DataType::kINT32);
Natsu-Akatsuki commented 9 months ago

我看您填的是177-179 这几层,您为什么这么填。原来的模型,我这边是用二分法的方法试出来的,原因和步骤可参考 https://github.com/Natsu-Akatsuki/RangeNetTrt8/issues/8。这边新的模型,可能要自己试一试

Natsu-Akatsuki commented 9 months ago

您好,有些超参数写死了,不知道你改了没有,例如: https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/utils/postprocess.cpp#L49

我在这里做了相应的修改,不过我不理解为什么是{1, 1, 64, 2048}而不是{1, 5, 64, 2048}

  const at::Tensor range_img_torch =
      torch::from_blob((float *)range_img, {1, 1, 128, 1024}).to(device_);
  at::Tensor range_torch =
      torch::from_blob((float *)range_arr, {point_num}).to(device_);
  const at::Tensor label_img_torch =
      torch::from_blob((float *)label_img, {1, 1, 128, 1024})
          .to(device_)
          .toType(torch::kFloat32);
  const at::Tensor px_torch =
      torch::from_blob((float *)pxs, {point_num}).to(device_);
  const at::Tensor pys_torch =
      torch::from_blob((float *)pys, {point_num}).to(device_);
  int img_width = 1024;

这个是已经后处理阶段了吧。后处理阶段的深度图就是 label image。已经不是 feature image 了。 另外这个 TensorRT 模型的最终输出应该也是 [1, 1, 128, 1024](按您的深度图的话),您这个 5 是怎么来的。第二个维度RangeNet 用 1 就好了。

Natsu-Akatsuki commented 9 months ago

我猜您这应该就是访问越界的问题,就是跟我设置的输入不一致,但哪里不一致,您可能要 check 一下

yuyaoliang commented 9 months ago

我看您填的是177-179 这几层,您为什么这么填。原来的模型,我这边是用二分法的方法试出来的,原因和步骤可参考 https://github.com/Natsu-Akatsuki/RangeNetTrt8/issues/8。这边新的模型,可能要自己试一试。

填177-179 这几层应该是重复设置了,我在最开始已经设置config->setFlag(nvinfer1::BuilderFlag::kTF32);,这样除了最后一层是int32,其他都是kTF32对吧?但是这种情况下还是会出现部分点云类别为-1

Natsu-Akatsuki commented 9 months ago

嗯,补充一下,这个是回复这个的 config->setFlag(nvinfer1::BuilderFlag::kFP16);设置为16时推理标签结果全为-1

我看您填的是177-179 这几层,您为什么这么填。原来的模型,我这边是用二分法的方法试出来的,原因和步骤可参考 https://github.com/Natsu-Akatsuki/RangeNetTrt8/issues/8。这边新的模型,可能要自己试一试。

Natsu-Akatsuki commented 9 months ago

然后关于您部分 -1 的问题,你可以参考 https://github.com/Natsu-Akatsuki/RangeNetTrt8/blob/67a7bfbf11258ecf1a47d58e691d3930fce20c6c/src/network/netTensorRT.cpp#L69 线尝试可视化深度图看一看 image-representation 的点云是怎样的,来排除是模型的问题,还是 reproject 2D -> 3D 的问题