XuyangBai / TransFusion

[PyTorch] Official implementation of CVPR2022 paper "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers". https://arxiv.org/abs/2203.11496
Apache License 2.0
619 stars 76 forks source link

training time #15

Open zzm-hl opened 2 years ago

zzm-hl commented 2 years ago

Hello, I am trying to train transfusion-L in the same way, the device is 4 * A100, batchsize=4, but in the training log, I found that it takes about 20 days as follows, the gpu usage is full. is it normal ? 2022-05-11 10:00:33,899 - mmdet - INFO - workflow: [('train', 1)], max: 20 epochs 2022-05-11 10:00:33,910 - mmdet - INFO - Checkpoints will be saved to /public/home/u212040344/TransFusion/work_dirs/transfusion_nusc_voxel_L by HardDiskBackend. 2022-05-11 10:01:03,697 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2022-05-11 10:09:59,947 - mmdet - INFO - Epoch [1][50/8007] lr: 1.000e-04, eta: 20 days, 21:02:54, time: 11.267, data_time: 0.303, memory: 10071, loss_heatmap: 232.7426, layer_-1_loss_cls: 5.0217, layer_-1_loss_bbox: 13.3883, matched_ious: 0.0015, loss: 251.1526, grad_norm: 1449.8635 2022-05-11 10:19:10,967 - mmdet - INFO - Epoch [1][100/8007] lr: 1.000e-04, eta: 20 days, 15:25:22, time: 11.021, data_time: 0.086, memory: 10276, loss_heatmap: 2.8675, layer_-1_loss_cls: 3.2865, layer_-1_loss_bbox: 5.6319, matched_ious: 0.0365, loss: 11.7859, grad_norm: 12.8393 2022-05-11 10:28:29,165 - mmdet - INFO - Epoch [1][150/8007] lr: 1.000e-04, eta: 20 days, 15:33:13, time: 11.164, data_time: 0.093, memory: 10276, loss_heatmap: 2.4009, layer_-1_loss_cls: 2.3035, layer_-1_loss_bbox: 3.6210, matched_ious: 0.0744, loss: 8.3254, grad_norm: 12.3999 2022-05-11 10:37:30,199 - mmdet - INFO - Epoch [1][200/8007] lr: 1.000e-04, eta: 20 days, 11:44:18, time: 10.821, data_time: 0.084, memory: 10276, loss_heatmap: 2.3020, layer_-1_loss_cls: 1.7072, layer_-1_loss_bbox: 3.6516, matched_ious: 0.0924, loss: 7.6607, grad_norm: 9.5290 2022-05-11 10:46:29,546 - mmdet - INFO - Epoch [1][250/8007] lr: 1.000e-04, eta: 20 days, 9:04:41, time: 10.786, data_time: 0.083, memory: 10276, loss_heatmap: 2.2058, layer_-1_loss_cls: 1.3065, layer_-1_loss_bbox: 3.4434, matched_ious: 0.1036, loss: 6.9556, grad_norm: 9.0192 2022-05-11 10:55:13,639 - mmdet - INFO - Epoch [1][300/8007] lr: 1.000e-04, eta: 20 days, 5:00:17, time: 10.482, data_time: 0.066, memory: 10276, loss_heatmap: 2.1473, layer_-1_loss_cls: 0.9710, layer_-1_loss_bbox: 3.2847, matched_ious: 0.1172, loss: 6.4030, grad_norm: 6.2690 2022-05-11 11:04:02,933 - mmdet - INFO - Epoch [1][350/8007] lr: 1.001e-04, eta: 20 days, 2:42:54, time: 10.586, data_time: 0.071, memory: 10329, loss_heatmap: 2.0265, layer_-1_loss_cls: 0.7707, layer_-1_loss_bbox: 2.8693, matched_ious: 0.1377, loss: 5.6664, grad_norm: 5.8471 2022-05-11 11:12:45,745 - mmdet - INFO - Epoch [1][400/8007] lr: 1.001e-04, eta: 20 days, 0:14:04, time: 10.455, data_time: 0.092, memory: 10329, loss_heatmap: 1.9567, layer_-1_loss_cls: 0.6361, layer_-1_loss_bbox: 2.7135, matched_ious: 0.1523, loss: 5.3063, grad_norm: 5.1551 2022-05-11 11:21:30,595 - mmdet - INFO - Epoch [1][450/8007] lr: 1.001e-04, eta: 19 days, 22:28:56, time: 10.498, data_time: 0.078, memory: 10329, loss_heatmap: 1.8753, layer_-1_loss_cls: 0.5564, layer_-1_loss_bbox: 2.6068, matched_ious: 0.1629, loss: 5.0386, grad_norm: 4.8415 2022-05-11 11:30:04,937 - mmdet - INFO - Epoch [1][500/8007] lr: 1.001e-04, eta: 19 days, 20:06:52, time: 10.287, data_time: 0.058, memory: 10329, loss_heatmap: 1.8338, layer_-1_loss_cls: 0.5122, layer_-1_loss_bbox: 2.5010, matched_ious: 0.1706, loss: 4.8469, grad_norm: 4.7645 2022-05-11 11:38:49,815 - mmdet - INFO - Epoch [1][550/8007] lr: 1.002e-04, eta: 19 days, 18:59:54, time: 10.497, data_time: 0.074, memory: 10329, loss_heatmap: 1.7888, layer_-1_loss_cls: 0.4798, layer_-1_loss_bbox: 2.4147, matched_ious: 0.1812, loss: 4.6832, grad_norm: 4.6580 2022-05-11 11:47:49,617 - mmdet - INFO - Epoch [1][600/8007] lr: 1.002e-04, eta: 19 days, 19:08:50, time: 10.796, data_time: 0.093, memory: 10329, loss_heatmap: 1.7230, layer_-1_loss_cls: 0.4455, layer_-1_loss_bbox: 2.2504, matched_ious: 0.1924, loss: 4.4189, grad_norm: 4.5364 2022-05-11 11:57:11,837 - mmdet - INFO - Epoch [1][650/8007] lr: 1.002e-04, eta: 19 days, 20:46:54, time: 11.245, data_time: 0.150, memory: 10329, loss_heatmap: 1.6958, layer_-1_loss_cls: 0.4242, layer_-1_loss_bbox: 2.2423, matched_ious: 0.1984, loss: 4.3624, grad_norm: 4.5513 2022-05-11 12:06:23,662 - mmdet - INFO - Epoch [1][700/8007] lr: 1.003e-04, eta: 19 days, 21:30:08, time: 11.037, data_time: 0.117, memory: 10329, loss_heatmap: 1.6594, layer_-1_loss_cls: 0.4082, layer_-1_loss_bbox: 2.1203, matched_ious: 0.2068, loss: 4.1879, grad_norm: 4.7048 2022-05-11 12:15:22,443 - mmdet - INFO - Epoch [1][750/8007] lr: 1.003e-04, eta: 19 days, 21:20:11, time: 10.776, data_time: 0.098, memory: 10329, loss_heatmap: 1.6039, layer_-1_loss_cls: 0.3894, layer_-1_loss_bbox: 2.1366, matched_ious: 0.2088, loss: 4.1299, grad_norm: 4.5718 2022-05-11 12:24:00,234 - mmdet - INFO - Epoch [1][800/8007] lr: 1.003e-04, eta: 19 days, 20:00:22, time: 10.354, data_time: 0.084, memory: 10329, loss_heatmap: 1.5675, layer_-1_loss_cls: 0.3778, layer_-1_loss_bbox: 2.0370, matched_ious: 0.2192, loss: 3.9823, grad_norm: 4.4354 2022-05-11 12:32:37,978 - mmdet - INFO - Epoch [1][850/8007] lr: 1.004e-04, eta: 19 days, 18:49:11, time: 10.356, data_time: 0.084, memory: 10329, loss_heatmap: 1.5402, layer_-1_loss_cls: 0.3653, layer_-1_loss_bbox: 2.0204, matched_ious: 0.2260, loss: 3.9259, grad_norm: 4.2611 2022-05-11 12:41:08,520 - mmdet - INFO - Epoch [1][900/8007] lr: 1.004e-04, eta: 19 days, 17:23:30, time: 10.211, data_time: 0.097, memory: 10329, loss_heatmap: 1.5248, layer_-1_loss_cls: 0.3564, layer_-1_loss_bbox

`------------------------------------------------------------ sys.platform: linux Python: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: NVIDIA A100-SXM4-40GB CUDA_HOME: /public/home/u212040344/usr/local/cuda-11.1 NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) PyTorch: 1.8.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.9.0 OpenCV: 4.5.5 MMCV: 1.3.18 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.12.0+5337046

` image

XuyangBai commented 2 years ago

Hi, it is not normal. The training time for 8V100 GPUs is about 2 days. One speculation is the IO speed, could you first check the read/write speed of your disk? (but it is also weird because in that case, the GPU usage should be very low)

zzm-hl commented 2 years ago

你好,这是不正常的。8V100 GPU 的训练时间约为 2 天。一个猜测是IO速度,你能先检查一下你磁盘的读/写速度吗?(但这也很奇怪,因为在这种情况下,GPU 使用率应该非常低) Thanks for your reply, I noticed that GPU utilization was always at 100% and didn't jump from high to low.I wonder if this has anything to do with my run script setup, here is my run script on the cluster. Can you help me check if the Settings are correct? In config I set sample_per_GPU = 4, num_worker_per_GPU =8. No other code changes have been made. This question puzzles me. Looking forward to your reply! `#!/bin/bash

SBATCH -J transfusion

SBATCH -p gpu

SBATCH -N 1

SBATCH --ntasks-per-node=32

SBATCH --gres=gpu:4

SBATCH --time=48:00:00

cd $SLURM_SUBMIT_DIR source /public/software/profile.d/apps_anaconda3-2021.05.sh source /public/software/profile.d/compiler_cmake-compiler-3.20.1.sh source /public/software/profile.d/compiler_gcc-7.3.1.sh source /public/home/u212040344/cuda-11.1.sh

conda activate open-mmlab cd TransFusion export PYTHONPATH="${PYTHONPATH}:TransFusion" export OMP_NUM_THREADS=1

python -m torch.distributed.launch --nproc_per_node=4 --master_port=-29502 ./tools/train.py /public/home/u212040344/TransFusion/configs/transfusion_nusc_voxel_L.py --launcher pytorch ` Or is it related to MMCV, mmdet version?

XuyangBai commented 2 years ago

Could your try to increase the OMP_NUM_THREADS?

zzm-hl commented 2 years ago

Could your try to increase the OMP_NUM_THREADS?

I have tried set OMP_NUM_THREADS=32,but it seems slower..

zzm-hl commented 2 years ago

Could your try to increase the OMP_NUM_THREADS? I tried to test on 1 GPU without changing samples_numbers and num_workers to eliminate the communication problem between Gpus, but I found the speed is still very slow as shown below. Divide its remaining time by 4, which seems to be about the same as 4 Gpus running in parallel.

2022-05-11 19:45:57,995 - mmdet - INFO - Epoch [1][50/32025] lr: 1.000e-04, eta: 119 days, 9:53:56, time: 16.109, data_time: 0.766, memory: 9886, loss_heatmap: 241.0956, layer_-1_loss_cls: 5.0219, layer_-1_loss_bbox: 13.4131, matched_ious: 0.0017, loss: 259.5305, grad_norm: 1624.4290 2022-05-11 19:56:16,605 - mmdet - INFO - Epoch [1][100/32025] lr: 1.000e-04, eta: 105 days, 13:25:47, time: 12.374, data_time: 0.278, memory: 9991, loss_heatmap: 3.0213, layer_-1_loss_cls: 3.6241, layer_-1_loss_bbox: 6.1702, matched_ious: 0.0309, loss: 12.8155, grad_norm: 17.8926 2022-05-11 20:05:29,993 - mmdet - INFO - Epoch [1][150/32025] lr: 1.000e-04, eta: 97 days, 16:58:10, time: 11.066, data_time: 0.172, memory: 9991, loss_heatmap: 2.6378, layer_-1_loss_cls: 2.5989, layer_-1_loss_bbox: 3.9740, matched_ious: 0.0569, loss: 9.2107, grad_norm: 21.1770 2022-05-11 20:14:28,252 - mmdet - INFO - Epoch [1][200/32025] lr: 1.000e-04, eta: 93 days, 5:17:57, time: 10.766, data_time: 0.135, memory: 10047, loss_heatmap: 2.5806, layer_-1_loss_cls: 1.9272, layer_-1_loss_bbox: 3.7113, matched_ious: 0.0633, loss: 8.2192, grad_norm: 19.9510 2022-05-11 20:23:04,763 - mmdet - INFO - Epoch [1][250/32025] lr: 1.000e-04, eta: 89 days, 21:10:15, time: 10.331, data_time: 0.125, memory: 10047, loss_heatmap: 2.4182, layer_-1_loss_cls: 1.5215, layer_-1_loss_bbox: 3.7239, matched_ious: 0.0781, loss: 7.6636, grad_norm: 8.7352 2022-05-11 20:32:08,322 - mmdet - INFO - Epoch [1][300/32025] lr: 1.000e-04, eta: 88 days, 7:41:22, time: 10.870, data_time: 0.219, memory: 10047, loss_heatmap: 2.3998, layer_-1_loss_cls: 1.1457, layer_-1_loss_bbox: 3.6249, matched_ious: 0.0914, loss: 7.1704, grad_norm: 9.3277

nmll commented 2 years ago

@zzm-hl Have you solved this problem? I train the first stage on 8*3090 bs=2/gpu by 4days with too much time.

My part training logs are: import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed fatal: not a git repository (or any parent up to mount point /) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). 2022-05-17 22:18:41,947 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.8.0 OpenCV: 4.5.5 MMCV: 1.3.0 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.11.0+

/data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( 2022-05-17 22:19:45,495 - mmdet - INFO - Start running, host: liyao@linke5, work_dir: /data/workspace/liyao/TransFusion/work_dirs/transfusion_nusc_voxel_L 2022-05-17 22:19:45,496 - mmdet - INFO - workflow: [('train', 1), ('val', 1)], max: 20 epochs 2022-05-17 22:20:08,208 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.

zzm-hl commented 2 years ago

when i install it with mmcv=1.3.0 mmdet=2.10.0,mmdet3d=0.11.0,it report "cuda error :no kernel image is available for execution on the device ". I don't know how to solve it. so I try to build it with another version(mmcv=1.3.18 and mmdet=2.11.0 mmdet3d=0.12.0), it works, but it need about 20 days on 4 a100 gpus with batchsize=4, it confused me some days ...

---Original--- From: @.> Date: Tue, May 17, 2022 22:36 PM To: @.>; Cc: @.**@.>; Subject: Re: [XuyangBai/TransFusion] training time (Issue #15)

@zzm-hl Have you solved this problem? I train the first stage on 8*3090 bs=2/gpu by 4days with too much time.

My part training logs are: import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed import DCN failed fatal: not a git repository (or any parent up to mount point /) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). 2022-05-17 22:18:41,947 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:

GCC 7.3

C++ Version: 201402

Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)

OpenMP 201511 (a.k.a. OpenMP 4.5)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 11.0

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37

CuDNN 8.0.3

Magma 2.5.2

Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0 OpenCV: 4.5.5 MMCV: 1.3.0 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.1 MMDetection: 2.11.0 MMDetection3D: 0.11.0+

/data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( /data/workspace/liyao/.conda/envs/lym3dpy38/lib/python3.8/site-packages/mmdet/apis/train.py:95: UserWarning: config is now expected to have a runner section, please set runner in your config. warnings.warn( 2022-05-17 22:19:45,495 - mmdet - INFO - Start running, host: @.***, work_dir: /data/workspace/liyao/TransFusion/work_dirs/transfusion_nusc_voxel_L 2022-05-17 22:19:45,496 - mmdet - INFO - workflow: [('train', 1), ('val', 1)], max: 20 epochs 2022-05-17 22:20:08,208 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

zzm-hl commented 2 years ago

Up to now I have not found a useful solution, I use the school cluster, I have tested the IO speed, but it is not slow (cluster is well configured), so I have not found the cause. Maybe I could try at MMCV =1.3.0, but I haven't been able to install it yet due to "CUDA error :no kernel image is available for execution on the device".