What is the principle of hybrid factors?

Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"

https://arxiv.org/abs/2301.06051

Apache License 2.0

361 stars 28 forks source link

What is the principle of hybrid factors? #12

Closed zen-star closed 1 year ago

zen-star commented 1 year ago

I'm confused about the hybrid factors.

Given a base window shape [12, 12, 32], a hybrid one is [24, 24, 32] with the hybrid factor [2, 2, 1]. [12, 12, 32] is for non-shifting, while [24, 24, 32] is for shifting with shifts [6, 6, 0]. It seems that the only difference between non-hybrid (swin/sst-like) and hybrid version is the window shape when shifting.

In paper, hybrid window partition is for better efficiency, but I don't find the detailed explanation. Is efficiency related to the redundant padding voxel tokens, which are fewer with a larger window shape? And why not try both large windows, e.g., window1=(24,24) & window2=(24,24) in Table 5?

Look forward to your reply.^^

Haiyang-W commented 1 year ago

Thanks for your interesting!

You are right. The difference is only the window shape, which aims to reduce the padding cost and deliver good performance-efficiency trade-off (see table 5). However, we observe that attention is efficient actually, thus slight padding doesn’t significantly slow down the inference. Frequent function calls to attention will increase the latency due to additional overheads, (e.g., memory access). Actually, non-hybrid variant is also not bad.
"Increasing the window sizes will reduce the number of sets and lead to the drop of self-attention’s computation cost, but also destroy the detection performance for small objects." Larger window shape will drop the performance, and won't give you much of a boost in speed (see Table 7).

Best Regards, Haiyang

zen-star commented 1 year ago

Thanks for your quick and clear response! And one more question. Have you tried other types of position encoding except for the FC-BN-RELU-FC in code?

Haiyang-W commented 1 year ago

We have not tried other positional encoding. The current approach does create some redundancy in the computation, which can be optimized ( hope 5ms -> 0ms in inference). However, in my shallow opinion, pos embedding may not be very important for point clouds that contain rich location information.

zen-star commented 1 year ago

I like your work very much! Hope the code for nuscenes and even indoor datasets (e.g., sun rgbd) can be soon released!

Haiyang-W commented 1 year ago

Thanks for your interest! We are working on the code of nuScenes and trying our best to release the code as soon as possible. :)

Best, Haiyang