Why FlashOcc could achieve such outstanding performance with only conv2d ops?

Yzichen / FlashOCC

Apache License 2.0

293 stars 31 forks source link

Why FlashOcc could achieve such outstanding performance with only conv2d ops? #7

Open QingXIA233 opened 10 months ago

QingXIA233 commented 10 months ago

Hello, @Yzichen brilliant idear and nice work! With all due respect, I wonder why FlashOcc could achieve such amazing results shown in yor paper with only conv2d ops. Couple months ago, I trained BevdetOcc2D which also uses 2D img_bev_encoder_backbone and img_bev_encoder_neck (z is collapsed), the only difference with FlashOcc is that before the features enter the head, they are reshaped to 3D, thus the head still uses Conv3d. When I read your code, I expect to see some extrodinary operation or design in the head part, but the difference is that it replaces the conv3d with conv2d simply and adds some reshape ops to get the final predictions. I can't help wondering why this kind of architecture works so well or maybe there is some novel design that I didn't notice here. Please provide some hints. Thank you a lot.

Yzichen commented 10 months ago

Compared with BEVDet-OCC, we indeed just replace conv3d with conv2d and then introduce a channel2height operation to obtain 3D occupancy prediction results. No other tricks.

QingXIA233 commented 10 months ago

Compared with BEVDet-OCC, we indeed just replace conv3d with conv2d and then introduce a channel2height operation to obtain 3D occupancy prediction results. No other tricks.

@Yzichen I got your point. However, there remain two concerns: 1. I undertsand the fact that replacing conv3d with conv2d ops makes the model run faster, but I don't get it why it could achieve better result. This is the reason why I asked about the tricks and designs. 2. Conv3d usually takes a longer time to converge than its conv2d counterpart. I wonder whether you trained the original bevdetocc model with more epochs to make sure it's fully trained. The first point above is the main mystery that draws my attention, I really hope that you could enlighten me a little bit. Thanks a lot.

Yzichen commented 10 months ago

To be honest, I don't have a definitive answer. Our original guess was that using conv2d would better benefit from the pre-trained detection weights, but later we found that conv2d performs slightly better than conv3d even without the detection weights.
Sorry, we lack the computational resources to train more epochs.

drilistbox commented 8 months ago

We thought the problem for several days. As the baseline methods (i.e. BEVDetOCC, UniOCC and FBOCC) based on voxel-level features would learn large number of non-object regions, thus 1) the easy non-object region can dominate training process and result in degenerated models, 2) the imbalanced samples increase learning difficulty for models. The above two issues offset the excellent performance of 3D-Conv.

LinuxCup commented 1 month ago

@drilistbox Hello, Can i ask a question! why does not use lidar directly for Occ task, most of which use images. PointCloud can use sparse Conv to avoid the above problems. I am a newer for this felid. Thinks!

drilistbox commented 1 month ago

@LinuxCup

camera-based solution is cheaper and more stable than the lidar-based one.
camera-based solution can construct the occupancy voxel for small or vimineous objects, which may be missed by the lidar point.
On nvidia's gpu, sparse-conv can also be applied to 3DConv-based methods(i.e., fbocc,bevdetocc,renderocc). But many oher edge-chips do not spport sparse-conv.

LinuxCup commented 1 month ago

@drilistbox think you for your reply! I will read the above parper you mentioned. Thinks.