PostProcessCuda too slow when update code to commit 4e8e4f3

orekides commented 2 years ago

TIME: doPostprocessCuda rise to around 80000ms when I use the lastest code(commit 4e8e4f3), before(commit db037d2) the number was around 5ms. Also the bndbox nums is too large.

the gpu info: GPU : Orin Capbility: 8.7 Global memory: 30622MB Const memory: 64KB SM in a block: 48KB warp size: 32 threads in a block: 1024 block dim: (1024,1024,64) grid dim: (2147483647,65535,65535)

the run time detail <<<<<<<<<<< load file: ../data/000000.bin find points num: 125635 find pillar_num: 9539 TIME: generateVoxels: 0.97344 ms. TIME: generateFeatures: 1.00912 ms. TIME: doinfer: 57.9808 ms. TIME: doPostprocessCuda: 57716.1 ms. TIME: pointpillar: 57777.3 ms. Bndbox objs: 3061

OPPOA113 commented 2 years ago

同样出现这个问题。 thresh 太小。输出Bndbox objs: 3061 这么多个目标，肯定是有问题的。thresh重新设置可以。

orekides commented 2 years ago

@OPPOA113 problem solved, thanks. But I wonder why the difference.

PeterJaq commented 2 years ago

同样出现这个问题。 thresh 太小。输出Bndbox objs: 3061 这么多个目标，肯定是有问题的。thresh重新设置可以。

你好，我尝试设置了
const float score_thresh = 0.5; const float nms_thresh = 0.5;

发现运行速度任然很慢。你有比较推荐的数值吗？

OPPOA113 commented 2 years ago

0.6 你就发现很快了。但是可视化出来之后，我发现错的比较多。和openpcdet工程pytorch版本输出的结果差距很大。

orekides commented 2 years ago

@OPPOA113 have you checked the old version(db037d2), does it also differs from openpcdet?

PeterJaq commented 2 years ago

0.6 你就发现很快了。但是可视化出来之后，我发现错的比较多。和openpcdet工程pytorch版本输出的结果差距很大。

谢谢，我试了一下，是这样的！

OPPOA113 commented 2 years ago

0.6 你就发现很快了。但是可视化出来之后，我发现错的比较多。和openpcdet工程pytorch版本输出的结果差距很大。

谢谢，我试了一下，是这样的！

有什么思路吗？卡在这有些许时间了。

orekides commented 2 years ago

I have tried this model on some robosense pointcloud, the result is weird. Maybe I should try to work on openpcdet(:

OPPOA113 commented 2 years ago

@OPPOA113 have you checked the old version(db037d2), does it also differs from openpcdet?

没有在以前版本试过。

OPPOA113 commented 2 years ago

I have tried this model on some robosense pointcloud, the result is weird. Maybe I should try to work on openpcdet(: openpcdet结果是正常的。但是cuda-pointpillar下可视化就是异常的。而且，thresh设置小于0.6，predict bndbox就达到几万。so应该是模型或者逻辑有问题的。

orekides commented 2 years ago

@OPPOA113 I wonder whether the developer ever tested it and compared with openpcdet<^ ^>.

OPPOA113 commented 2 years ago

@OPPOA113 I wonder whether the developer ever tested it and compared with openpcdet<^ ^>.

@byte-deve 哈哈，坐等大佬回复。。

OPPOA113 commented 2 years ago

@OPPOA113 I wonder whether the developer ever tested it and compared with openpcdet<^ ^>.

又搞了半天，还是没发现是哪里的问题。显示出来的效果是这样的同样的模型在openpcdet中显示是正常的。。 @byte-deve 有什么建议吗

byte-deve commented 2 years ago

Hi @OPPOA113,

Are you on Jetson Xavier fp16 mode? I'm trying to reproduce your issue. It's suspected to be caused by newly introduced output format kHWC8. Under validation.

BR Tony

OPPOA113 commented 2 years ago

Hi @OPPOA113,

Are you on Jetson Xavier fp16 mode? I'm trying to reproduce your issue. It's suspected to be caused by newly introduced output format kHWC8. Under validation.

BR Tony

不是Jetson。是x86 Ubuntu系统 fp16精度。我想问下，kHWC8 这个8是指什么？谢谢

byte-deve commented 2 years ago

One kind of tensor format, you could refer to link below: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc

x86 use fp32 in default, so the issue might related to other incremental update. You could use the previous db037d2 at this moment, will fix the mentioned issue shortly.

OPPOA113 commented 2 years ago

One kind of tensor format, you could refer to link below: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc

x86 use fp32 in default, so the issue might related to other incremental update. You could use the previous db037d2 at this moment, will fix the mentioned issue shortly.

ok。谢谢。 x86我添加了定义，改为fp16模式了。 ok 那我再试试之前commit的版本。

OPPOA113 commented 2 years ago

One kind of tensor format, you could refer to link below: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc x86 use fp32 in default, so the issue might related to other incremental update. You could use the previous db037d2 at this moment, will fix the mentioned issue shortly.

ok。谢谢。 x86我添加了定义，改为fp16模式了。 ok 那我再试试之前commit的版本。

另外，x86的fp32我也试过了。也是出现同样的问题。

byte-deve commented 2 years ago

Hi @OPPOA113 @orekides @PeterJaq, thanks for your comments.

The root-cause is mistaken dense-shape here. Please try with later fix commit.

BTW, we're planning to add metrics for KITTI and WAYMO dataset. You can keep watching on this repo.

BR Tony

byte-deve commented 2 years ago

Fix merged.

OPPOA113 commented 2 years ago

Fix merged.

ok. 非常感谢大佬。 testing，可视化已正常显示。但看到有较多的误检，我详细测试再反馈。。。

OPPOA113 commented 2 years ago

Fix merged.

ok. 非常感谢大佬。 testing，可视化已正常显示。但看到有较多的误检，我详细测试再反馈。。。

@byte-deve 测试反馈： 1.输出结果显示正常。 2.torch版本与cpp版，在相同的thresh,nms_thresh时，torch输出目标数较cpp多出约500个。下图，坐标为torch，右边为cpp:

cpp 多次跑demo，结果输出基本一致，个数相差20个左右： 4.cpp可视化： 5.torch可视化：

orekides commented 2 years ago

I have tried this model on some robosense pointcloud, the result is weird.

after the fix commit, it works(^ _ ^), the performance is much the same as @OPPOA113 has reported. many thanks. Also look forward to the metrics for public dataset.

NVIDIA-AI-IOT / CUDA-PointPillars

PostProcessCuda too slow when update code to commit 4e8e4f3 #43