Closed hly2990 closed 1 year ago
VoxFormer-S uses only one image to predict the depth map which will be less accurate than the one generated in VoxFormer-T where they use 5 images (current entries and 4 previous entries). An accurate depth map means accurate voxel queries, which will obviously impact model performance. And it does not depend on the size of the model, only on the accuracy of the 3D occupancy grid (voxel queries).
The difference lies in stage-2 rather than stage-1. VoxFormer-S only interacts with the current frame using voxel queries, while VoxFormer-T interacts with current and previous frames.
Hi~ I'd like to know the difference between VoxFormer-S and VoxFormer-T. Their model sizes look identical. What is the purpose of a sequential frame?