intermediate sparse tensor not align

NVIDIA-AI-IOT / Lidar_AI_Solution

A project demonstrating Lidar related AI solutions, including three GPU accelerated Lidar/camera DL networks (PointPillars, CenterPoint, BEVFusion) and the related libs (cuPCL, 3D SparseConvolution, YUV2RGB, cuOSD,).

Other

1.35k stars 239 forks source link

intermediate sparse tensor not align #178

Open HuangVictorAuto opened 1 year ago

HuangVictorAuto commented 1 year ago

@hopef , background is: during the deployment test for voxelnext model, I want to align the pytorch result(with yanyan spconv) and the deployment result(nvidia spconv). I found that the intermediate result from scn net can't not strictly align, expeclially for the spase tensor indices. I want to know this is known to you, or this is some bug. thanks!

Voxelnext input_conv/ conv1 indices is the same, the value difference is acceptable. after conv2,conv3,conv4, conv5,with the downsample layers and etc. the indices are not the same, value have bigger difference as the indices are not the same. conv6, have the same indices and close value.

Centerpoint I have tested the centerpoint model again, I found the same: input_conv/conv1 indices and value same or close. after conv2,conv3,conv4 indices not the same, valuse bigger difference as indices are not same. conv_out, result shows same indices and close value.

it is strange to me here: major difference between pytorch and libspconv conv1 result index x, 45-2014

conv2 result index x, should arond 45/2-2014/2 the libspconv have value around 5?

hopef commented 1 year ago

Did you check the final output differences? The code for checking the final result is provided in the compare.py.
You can still use compare.py to compare results. Because the indices will sort before comparison.

HuangVictorAuto commented 1 year ago

thansk for feedback.

I do check the final output differences for centerpoint model. It is OK. It is the same as my above comparision. conv_out indices is same, conv_out value is close. but for voxlenext model, I not only need the final output conv_out, but also I need intermediate sparse conv out result. above result shows not aligned voxel indices, especially after downsample sparseconv model.
during my comparision, I also first lexsort the indices and value and then do the comparison.

So I hope you can check why we have such different indices after sparseconv downsample. above pics can show the main difference.

for downsample stride 2, the starting index should start from 45/2, but the libspconv op starting index started from 5?

thanks!

hopef commented 1 year ago

Could you upload the onnx file here? Thanks.

HuangVictorAuto commented 1 year ago

scn_sch_onnx.zip scn_centerpoint_all.onnx is scn part for centerpoint model. scn_voxelnxet_all.onnx is scn part for voxelnext model. sch_voxelnext_all.onnx is sparse head part for voxelnext model. you can pay attention to this onnx also. I also found some issue here for submanifold 2d spconv. https://github.com/NVIDIA-AI-IOT/Lidar_AI_Solution/issues/123

HuangVictorAuto commented 1 year ago

@hopef , sorry to bother, any updates on this issue?

hopef commented 1 year ago

Could you give me a simple reproduction of the data and code? My guess is that you may be encountering an internal tensor reuse mechanism. The tensor you got may be reused in other locations.

hopef commented 1 year ago

I submitted the fixed version libspconv1.1.1. the root cause is due to the different reuse schemes between the rulebook and tensor. Thanks for your effort.

HuangVictorAuto commented 1 year ago

thanks for your update, I have checked the result again,
after first downsample spconv， the indexes and values are still not the same, this time it changed to this, different from 1.1.0 version, but still not aligned with pytoch:

hopef commented 1 year ago

Can you give me a simple reproduction program? Because it passes on my test case.

HuangVictorAuto commented 1 year ago

Hi, I tried to come up with an easy sample to demenstrate the bug, but it is very hard. It passes most case. But from what I tested, If the output is the only output, then the result is aligned between pytorch model and engine model. But if we have multiple output, then the intermediate result is different. Here is an example, The same ONNX model but with multiple output. The result for intermediate output 18 is different.

scn_2_3.zip

hopef commented 1 year ago

Hi, @HuangVictorAuto I found the root cause of this bug. And I pushed a new spconv-1.1.2 to fix this bug. The main reason is that the reuse rules for indices follow the rulebook, unlike features.

HuangVictorAuto commented 1 year ago

@hopef ， thanks for the fast update. I tested the above sample. This time it changes to the following, still not align. Have you test the above two onnx I prepared to you? with the same input, is the output18 the same?

hopef commented 1 year ago

Hi, can you add me on WeChat? My WeChat id is: woshixiwanga

JackyChan97 commented 12 months ago

@HuangVictorAuto Hi, I got the same problem. Could you solve it?

HuangVictorAuto commented 12 months ago

@JackyChan97 , not solved here. I turned back to the original spconv solution.

JackyChan97 commented 12 months ago

thank you for your reply

------------------ Original ------------------ From: HuangVictorAuto @.> Date: Tue,Dec 5,2023 9:13 PM To: NVIDIA-AI-IOT/Lidar_AI_Solution @.> Cc: ZhuojieChen @.>, Mention @.> Subject: Re: [NVIDIA-AI-IOT/Lidar_AI_Solution] intermediate sparse tensor notalign (Issue #178)

@JackyChan97 , not solved here. I turned back to the original spconv solution.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>