Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"
https://arxiv.org/abs/2301.06051
Apache License 2.0
361 stars 28 forks source link

What is the principle of hybrid factors? #12

Closed zen-star closed 1 year ago

zen-star commented 1 year ago

I'm confused about the hybrid factors.

Given a base window shape [12, 12, 32], a hybrid one is [24, 24, 32] with the hybrid factor [2, 2, 1]. [12, 12, 32] is for non-shifting, while [24, 24, 32] is for shifting with shifts [6, 6, 0]. It seems that the only difference between non-hybrid (swin/sst-like) and hybrid version is the window shape when shifting.

In paper, hybrid window partition is for better efficiency, but I don't find the detailed explanation. Is efficiency related to the redundant padding voxel tokens, which are fewer with a larger window shape? And why not try both large windows, e.g., window1=(24,24) & window2=(24,24) in Table 5?

Look forward to your reply.^^

Haiyang-W commented 1 year ago

Thanks for your interesting!

Best Regards, Haiyang

zen-star commented 1 year ago

Thanks for your quick and clear response! And one more question. Have you tried other types of position encoding except for the FC-BN-RELU-FC in code?

Haiyang-W commented 1 year ago

We have not tried other positional encoding. The current approach does create some redundancy in the computation, which can be optimized ( hope 5ms -> 0ms in inference). However, in my shallow opinion, pos embedding may not be very important for point clouds that contain rich location information.

zen-star commented 1 year ago

I like your work very much! Hope the code for nuscenes and even indoor datasets (e.g., sun rgbd) can be soon released!

Haiyang-W commented 1 year ago

Thanks for your interest! We are working on the code of nuScenes and trying our best to release the code as soon as possible. :)

Best, Haiyang