XuyangBai / TransFusion

[PyTorch] Official implementation of CVPR2022 paper "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers". https://arxiv.org/abs/2203.11496
Apache License 2.0
619 stars 76 forks source link

About GPU memory usage #10

Closed Fan-Yixuan closed 1 year ago

Fan-Yixuan commented 2 years ago

Thanks for your great work! I am trying to reimplement your work with the new version (v1.0.0) of mmd3d, my environment:

sys.platform: linux
Python: 3.8.11 (default, Aug  3 2021, 15:09:35) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.74
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.9.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.1
OpenCV: 4.5.3
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.23.0
MMSegmentation: 0.24.0
MMDetection3D: 1.0.0rc1+c7cde78

I have dealt with the coordinate system refactoring problem and also the img_fields issue, but I can only train with up to 50 query proposals while with one sample per 24GB RTX3090 GPU, using the default config (nuscenes, Lidar and camera, R50FPN, second lidar backbone, 200queries) will encounter CUDA OOM.

Noting your practice https://github.com/XuyangBai/TransFusion/issues/6#issuecomment-1111812026, I hereby seek help. I also didn't notice your use of spconv, hope you can provide more details. Thanks a lot.

XuyangBai commented 2 years ago

Hi @Fan-Yixuan Thanks for your interest in our work. I have tried training TransFusion on 8 3090GPUs and it could fit into the memory, not sure what happens in your environment. But you could try to use spconv 1.2 to reduce the memory. The spconv is used in SparseEncoder. mmdet3d includes spconv in their repo in mmdet3d/ops/spconv but it is a old version. To use another version of spconv, I used to install it following the instruction here and replace https://github.com/XuyangBai/TransFusion/blob/53370467c1b88f163cbe7b7300a1f588a6761e35/mmdet3d/ops/spconv/__init__.py#L14-L20

by something like

from spconv import SparseConv2d, ...
Fan-Yixuan commented 2 years ago

Thanks a lot for your help, I'm using the latest spconv 2.1.21 and now I can train 200 queries with one sample per 3090 using ~22GB memory. While 2 samples per GPU is still not achievable. I will keep exploring to better solve this problem!

Fan-Yixuan commented 2 years ago

@XuyangBai Hi dear author, I would like to ask if TransFusion's prediction heads do not contain branches for attribute prediction (moving, stopped, parked vehicle, etc.). I'm not familiar with this task (nuScenes), why does it work like this instead of reducing AAE by adding such branches.

XuyangBai commented 2 years ago

I basically follow the mmdet3d and achieve the attribute prediction using some post-processing rules, check the code here: https://github.com/XuyangBai/TransFusion/blob/53370467c1b88f163cbe7b7300a1f588a6761e35/mmdet3d/datasets/nuscenes_dataset.py#L319

Fan-Yixuan commented 2 years ago

Yes I noticed, but it seems strange to directly use the default attribute, is there any official statement as to why this is done?

XuyangBai commented 2 years ago

Ah sorry I just use it as the de-facto, never carefully think about this issue

Fan-Yixuan commented 2 years ago

Ok, since mmd3d implements it like this, it should have its own reason 2333333

Fan-Yixuan commented 2 years ago

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

XuyangBai commented 2 years ago

Did you use the newest code? There is some bugs when changing the shape of image features, leading to the mismatch between two modalities, which are fixed in https://github.com/XuyangBai/TransFusion/commit/8977b2b9ed74526b1ac4dd5ce9b60c195dc47056 and https://github.com/XuyangBai/TransFusion/commit/5187414d90d9a216f1e07520bf637a79f664132f

XuyangBai commented 2 years ago

GlobalRotScaleTrans and RandomFlip3D will not break the matching between LiDAR and camera because every time we project the object queries(and initial prediction) from 3d space onto the image plane, we first do the inverse transformation, which will convert the augmented 3d positions back to the original coordinate. See the following code: https://github.com/XuyangBai/TransFusion/blob/5187414d90d9a216f1e07520bf637a79f664132f/mmdet3d/models/dense_heads/transfusion_head.py#L944-L948

BTW, I just realize that this might be the problem: here I assume the BS=1 is for evaluation time so I skip the apply_3d_transformation for fast inference. If you use samples_per_gpu=1 for training you should remove this logic and always apply apply_3d_transformation.

Fan-Yixuan commented 2 years ago

I'm using the latest version of the code, and I'm using 2 samples per GPU, and I have another question, RandomFlip3D's parent class, RandomFlip, doesn't support flipping a list of images, will it matter?

XuyangBai commented 2 years ago

Yes, It might be the reason. If the flip in img_field is set to True but the image is not flipped, then the consistency between Lidar and the image is broken. You can check the preprocessing classes to figure out how it works in my implementation, I do not remember exactly where I did the conversion from a list of images to a ndarray.

Fan-Yixuan commented 2 years ago

My concern is that maybe https://github.com/open-mmlab/mmdetection/blob/master/mmdet/datasets/pipelines/transforms.py#L465-L469 should be changed to a loop over the list of images like you did in MyResize etc., but I don't understand why you and https://github.com/XuyangBai/TransFusion/issues/7#issuecomment-1118447999 are able to get the correct training results based on the current code.

XuyangBai commented 2 years ago

@Fan-Yixuan I find mmcv.imflip do works for list of images, see the following example:

截屏2022-05-12 上午8 35 15

Fan-Yixuan commented 2 years ago

Thanks for the explanation, that's true, but I still can't seem to solve my problem. The strangest thing I found is the change of loss_bbox during the training process as shown in the figure. The orange line is the result of LiDAR only, and the red line is the result of LC. Do you have any suggestions? thanks a lot. 2022-05-12 17-03-15 的屏幕截图 Also I found that I didn't notice the changes in train.py, i.e. I didn't freeze the LiDAR branch, combined with the figure above, I now think this is likely the reason.

XuyangBai commented 2 years ago

It is really weird that the bbox loss turns to increase at some point, the curve before 10k looks normal. I am not sure the reason but maybe you can first verify the projection of object queries onto the image through some visualization? If the lidar and image are not aligned well, the image feature attached to the object queries will be wrong. BTW, you mentioned the mATE, mASE, mAOE are all increasing, so how about mAP?

Fan-Yixuan commented 2 years ago

The first three epochs after adding camera, mAP: 62.49, 58.86, 59.66. I feel that the loss turns to increase is probably because the learning rate becomes larger (I use 4*3090 with 2 samples per GPU so I forward propagation twice and then update the parameters to make batch size equals 16, thus learning rate reaches a maximum at around 40k iters)

Do you think this is normal if the LiDAR branch is not frozen?

XuyangBai commented 2 years ago

The learning rate should not be the reason. I have also tried to use batch_size 8*1.

Yes, I freeze the LiDAR branch during training of TransFusion as it is already well trained in the first stage. If you would like to jointly optimize the lidar branch and the fusion component, maybe they should be operimized in different learning rates.

Fan-Yixuan commented 2 years ago

Hi, sorry for the late reply. I made two changes: the first follows your changes in the dataset definition file, but from what I understand this shouldn't have a real impact. 2022-05-14 10-34-48 的屏幕截图

The second is to freeze the weights of the LiDAR branch. Now I can get 66.75mAP/71.03NDS on the nuScenes validation set. So I think the previous problem is caused by using too large learning rate for the LiDAR branch which has been well trained.

XuyangBai commented 2 years ago

Yes, the order of images does not affect a lot but freezing the backbone did.

Fan-Yixuan commented 2 years ago

Ok, thank you for your patience and your excellent work, I close this issue.

nmll commented 2 years ago

@Fan-Yixuan Hello! Could you tell me the max learning rate in your training step of the first stage and second stage separately?

Fan-Yixuan commented 2 years ago

Hi, my experiment follows the code given by the author https://github.com/XuyangBai/TransFusion/blob/53370467c1b88f163cbe7b7300a1f588a6761e35/configs/transfusion_nusc_voxel_L.py#L244-L250 https://github.com/XuyangBai/TransFusion/blob/53370467c1b88f163cbe7b7300a1f588a6761e35/configs/transfusion_nusc_voxel_LC.py#L246-L252

nmll commented 2 years ago

OK! Thanks!

nmll commented 2 years ago

GlobalRotScaleTrans and RandomFlip3D will not break the matching between LiDAR and camera because every time we project the object queries(and initial prediction) from 3d space onto the image plane, we first do the inverse transformation, which will convert the augmented 3d positions back to the original coordinate. See the following code:

https://github.com/XuyangBai/TransFusion/blob/5187414d90d9a216f1e07520bf637a79f664132f/mmdet3d/models/dense_heads/transfusion_head.py#L944-L948

BTW, I just realize that this might be the problem: here I assume the BS=1 is for evaluation time so I skip the apply_3d_transformation for fast inference. If you use samples_per_gpu=1 for training you should remove this logic and always apply apply_3d_transformation.

Hello! @XuyangBai May I ask about this apply_3d_transformation is only used in projecting 3D to 2D query, but is not used in adding the BEV lidar feature and BEV image feature for image guided query initialization. Will this be a mismatch between lidar and image modalities due to the Radomflip3d and GlobalRotScaleTrans?

XuyangBai commented 2 years ago

Hi @nmll That's a very good question that I didn't realize previously. Intuitively, the point clouds should also be transformed using the inversion of data augmentation when projecting image features onto the BEV plane (or equivalently, I should perform a similar rotation and flip to images, which is somewhat complicated). However, the network still works under the current settings. My guess is that the network is able to 1) leverage the contextual relationship (between image features and LiDAR features) to associate the two sets of features and thus perform the projection, and 2) ignore the geometry relationship brought by the position encodings of image features and LiDAR features.

Furthermore, I have run another experiment that removes the RandomFlip and GlobalRotScaleTrans during training to see whether forcing the two modalities to be consistent will further improve the results. In this case, the network could also leverage the geometry relationship to build the association. The observation is that: the training loss is decreasing more rapidly compared with the previous setting. The blue curve in the following figure is the one without RandomFlip & GlobalRotScaleTrans while the gray curve is the original one. However, the final mAP and NDS is similar. So I assume that removing these two augmentations will increase the convergence speed but the final performance might be already saturated (although the heatmap_loss could be further reduced, the object queries selected by the heatmap are already with good locations, so the improvement is not remarkable in terms of final mAP and NDS)

截屏2022-06-01 下午9 27 06 截屏2022-06-01 下午9 27 14

I will remove the RandomFlip and GlobalRotScaleTrans in the config files, which is more reasonable and gives better convergence speed. Thanks a lot for pointing out that issue.

Best, Xuyang

heming7 commented 2 years ago

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hello @Fan-Yixuan

Can you tell what you did to solve the version issue? I am facing the same problem now.

Fan-Yixuan commented 2 years ago

Hi @heming7, you need to make sure that results['img_fields'] is ['img'] and type(results['img']) is list before these code: https://github.com/XuyangBai/TransFusion/blob/8977b2b9ed74526b1ac4dd5ce9b60c195dc47056/mmdet3d/datasets/pipelines/loading.py#L187-L190

heminghuang7 commented 2 years ago

Hi @heming7, you need to make sure that results['img_fields'] is ['img'] and type(results['img']) is list before these code:

https://github.com/XuyangBai/TransFusion/blob/8977b2b9ed74526b1ac4dd5ce9b60c195dc47056/mmdet3d/datasets/pipelines/loading.py#L187-L190

Hello Yixuan

Thank you for the suggestion. I checked the code and I think the author has pushed a commit that fixes this. But I manage to run it by reducing the value samples_per_gpu. Anyway, thank you so much for the help!

yinjunbo commented 1 year ago

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

Fan-Yixuan commented 1 year ago

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

Hi my env: https://github.com/XuyangBai/TransFusion/issues/10#issue-1226416264, my spconv: 2.1.21 my total batchsize: 16, lr: 1e-4

yinjunbo commented 1 year ago

@XuyangBai Hi dear author, I finished training transfusion_nusc_voxel_L and got val set performance of 64.63mAP/69.99NDS. The previous problem about GPU memory has been solved, which was because the images were not resized to 448*800 due to some version issue. However, the training after adding cameras encountered some problems, mATE, mASE, mAOE, and mAVE are all increasing with training. Do you have any suggestion for possible causes? My question is, it seems that GlobalRotScaleTrans and RandomFlip3D will cause mismatch between LiDAR and camera?

Hi, @Fan-Yixuan, could you please share your torch/cuda/mmdet3d/spconv environment you've used to reproduce the nusc val performance (64.63mAP and 69.99NDS)? It seems that you used 8*3090 with batch size 2 and lr 1e-4?

Hi my env: #10 (comment), my spconv: 2.1.21 my total batchsize: 16, lr: 1e-4 @Fan-Yixuan , Thanks for your quik reply! I'll have another try. Btw, could you please share your traing log, so I can check my problem accordingly(email: yinjunbocn@gmail.com)?

Fan-Yixuan commented 1 year ago

@yinjunbo Sure, for training lidar-only, the first 15 epochs: 20220505_225100.log the last 5 epochs (fade strategy): 20220508_101828.log

yinjunbo commented 1 year ago

@yinjunbo Sure, for training lidar-only, the first 15 epochs: 20220505_225100.log the last 5 epochs (fade strategy): 20220508_101828.log

Thank you very much! I find that my training loss is obvisously larger than yours. Did you try to train a model before coordinate system refactoring ?

Fan-Yixuan commented 1 year ago

@yinjunbo Sorry I didn't save the training logs before modifying the coordinate system, but if the coordinate is not aligned, it should work very poorly.

yinjunbo commented 1 year ago

@yinjunbo Sorry I didn't save the training logs before modifying the coordinate system, but if the coordinate is not aligned, it should work very poorly.

I totally agree with you. Since my repreoced performance is just slightly lower (~2 points) than yours, this could not be caused by coordinate system. I'll continue to find the problems. tks!

BoomSky0416 commented 1 year ago

@Fan-Yixuan Hello, I am trying to reproduce transfusion in mmdet3d-1.1.0. But I got the wrong result in training lidar-camera fusion stage. Could you please share your training log for this stage, thanks! (email: shoutian@umich.edu)