Closed zen-star closed 1 year ago
Thanks for your interesting!
Best Regards, Haiyang
Thanks for your quick and clear response! And one more question. Have you tried other types of position encoding except for the FC-BN-RELU-FC in code?
We have not tried other positional encoding. The current approach does create some redundancy in the computation, which can be optimized ( hope 5ms -> 0ms in inference). However, in my shallow opinion, pos embedding may not be very important for point clouds that contain rich location information.
I like your work very much! Hope the code for nuscenes and even indoor datasets (e.g., sun rgbd) can be soon released!
Thanks for your interest! We are working on the code of nuScenes and trying our best to release the code as soon as possible. :)
Best, Haiyang
I'm confused about the hybrid factors.
Given a base window shape [12, 12, 32], a hybrid one is [24, 24, 32] with the hybrid factor [2, 2, 1]. [12, 12, 32] is for non-shifting, while [24, 24, 32] is for shifting with shifts [6, 6, 0]. It seems that the only difference between non-hybrid (swin/sst-like) and hybrid version is the window shape when shifting.
In paper, hybrid window partition is for better efficiency, but I don't find the detailed explanation. Is efficiency related to the redundant padding voxel tokens, which are fewer with a larger window shape? And why not try both large windows, e.g., window1=(24,24) & window2=(24,24) in Table 5?
Look forward to your reply.^^