Open HuangVictorAuto opened 1 year ago
thansk for feedback.
I do check the final output differences for centerpoint model. It is OK. It is the same as my above comparision. conv_out indices is same, conv_out value is close. but for voxlenext model, I not only need the final output conv_out, but also I need intermediate sparse conv out result. above result shows not aligned voxel indices, especially after downsample sparseconv model.
during my comparision, I also first lexsort the indices and value and then do the comparison.
So I hope you can check why we have such different indices after sparseconv downsample. above pics can show the main difference.
for downsample stride 2, the starting index should start from 45/2, but the libspconv op starting index started from 5?
thanks!
Could you upload the onnx file here? Thanks.
scn_sch_onnx.zip scn_centerpoint_all.onnx is scn part for centerpoint model. scn_voxelnxet_all.onnx is scn part for voxelnext model. sch_voxelnext_all.onnx is sparse head part for voxelnext model. you can pay attention to this onnx also. I also found some issue here for submanifold 2d spconv. https://github.com/NVIDIA-AI-IOT/Lidar_AI_Solution/issues/123
@hopef , sorry to bother, any updates on this issue?
Could you give me a simple reproduction of the data and code? My guess is that you may be encountering an internal tensor reuse mechanism. The tensor you got may be reused in other locations.
I submitted the fixed version libspconv1.1.1. the root cause is due to the different reuse schemes between the rulebook and tensor. Thanks for your effort.
thanks for your update, I have checked the result again,
after first downsample spconv, the indexes and values are still not the same, this time it changed to this, different from 1.1.0 version, but still not aligned with pytoch:
Can you give me a simple reproduction program? Because it passes on my test case.
Hi, I tried to come up with an easy sample to demenstrate the bug, but it is very hard. It passes most case. But from what I tested, If the output is the only output, then the result is aligned between pytorch model and engine model. But if we have multiple output, then the intermediate result is different. Here is an example, The same ONNX model but with multiple output. The result for intermediate output 18 is different.
Hi, @HuangVictorAuto I found the root cause of this bug. And I pushed a new spconv-1.1.2 to fix this bug. The main reason is that the reuse rules for indices follow the rulebook, unlike features.
@hopef , thanks for the fast update. I tested the above sample. This time it changes to the following, still not align. Have you test the above two onnx I prepared to you? with the same input, is the output18 the same?
Hi, can you add me on WeChat? My WeChat id is: woshixiwanga
@HuangVictorAuto Hi, I got the same problem. Could you solve it?
@JackyChan97 , not solved here. I turned back to the original spconv solution.
thank you for your reply
------------------ Original ------------------ From: HuangVictorAuto @.> Date: Tue,Dec 5,2023 9:13 PM To: NVIDIA-AI-IOT/Lidar_AI_Solution @.> Cc: ZhuojieChen @.>, Mention @.> Subject: Re: [NVIDIA-AI-IOT/Lidar_AI_Solution] intermediate sparse tensor notalign (Issue #178)
@JackyChan97 , not solved here. I turned back to the original spconv solution.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@hopef , background is: during the deployment test for voxelnext model, I want to align the pytorch result(with yanyan spconv) and the deployment result(nvidia spconv). I found that the intermediate result from scn net can't not strictly align, expeclially for the spase tensor indices. I want to know this is known to you, or this is some bug. thanks!
Voxelnext input_conv/ conv1 indices is the same, the value difference is acceptable. after conv2,conv3,conv4, conv5,with the downsample layers and etc. the indices are not the same, value have bigger difference as the indices are not the same. conv6, have the same indices and close value.
Centerpoint I have tested the centerpoint model again, I found the same: input_conv/conv1 indices and value same or close. after conv2,conv3,conv4 indices not the same, valuse bigger difference as indices are not same. conv_out, result shows same indices and close value.
it is strange to me here: major difference between pytorch and libspconv conv1 result index x, 45-2014
conv2 result index x, should arond 45/2-2014/2 the libspconv have value around 5?