Open QingXIA233 opened 10 months ago
Compared with BEVDet-OCC, we indeed just replace conv3d with conv2d and then introduce a channel2height operation to obtain 3D occupancy prediction results. No other tricks.
Compared with BEVDet-OCC, we indeed just replace conv3d with conv2d and then introduce a channel2height operation to obtain 3D occupancy prediction results. No other tricks.
@Yzichen I got your point. However, there remain two concerns: 1. I undertsand the fact that replacing conv3d with conv2d ops makes the model run faster, but I don't get it why it could achieve better result. This is the reason why I asked about the tricks and designs. 2. Conv3d usually takes a longer time to converge than its conv2d counterpart. I wonder whether you trained the original bevdetocc model with more epochs to make sure it's fully trained. The first point above is the main mystery that draws my attention, I really hope that you could enlighten me a little bit. Thanks a lot.
We thought the problem for several days. As the baseline methods (i.e. BEVDetOCC, UniOCC and FBOCC) based on voxel-level features would learn large number of non-object regions, thus 1) the easy non-object region can dominate training process and result in degenerated models, 2) the imbalanced samples increase learning difficulty for models. The above two issues offset the excellent performance of 3D-Conv.
@drilistbox Hello, Can i ask a question! why does not use lidar directly for Occ task, most of which use images. PointCloud can use sparse Conv to avoid the above problems. I am a newer for this felid. Thinks!
@LinuxCup
@drilistbox think you for your reply! I will read the above parper you mentioned. Thinks.
Hello, @Yzichen brilliant idear and nice work! With all due respect, I wonder why FlashOcc could achieve such amazing results shown in yor paper with only conv2d ops. Couple months ago, I trained BevdetOcc2D which also uses 2D img_bev_encoder_backbone and img_bev_encoder_neck (z is collapsed), the only difference with FlashOcc is that before the features enter the head, they are reshaped to 3D, thus the head still uses Conv3d. When I read your code, I expect to see some extrodinary operation or design in the head part, but the difference is that it replaces the conv3d with conv2d simply and adds some reshape ops to get the final predictions. I can't help wondering why this kind of architecture works so well or maybe there is some novel design that I didn't notice here. Please provide some hints. Thank you a lot.