Loss explosion with outdoor data.

Hiusam commented 1 year ago

Hi, I ran your code with Base_Omni3D_out config and encountered

cubercnn WARNING: Skipping gradient update due to higher than normal loss 44.94 vs. rolling mean 3.55, Dict-> {'BoxHead/loss_box_reg': 1.9145429134368896, 'BoxHead/loss_cls': 19.28242301940918, 'Cube/loss_dims': 0.0018776070792227983, 'Cube/loss_joint': 0.06404059380292892, 'Cube/loss_pose': 0.018857350572943687, 'Cube/loss_xy': 0.0025855381973087788, 'Cube/loss_z': 0.012942994013428688, 'Cube/uncert': 21.71675682067871, 'rpn/cls': 0.5578740239143372, 'rpn/loc': 1.372727870941162}

after iteration 43k.

I also found that scaling up the batch size to 160 made the model even easier to encounter Skipping gradient update due to higher than normal loss.

Is this a normal phenomenon? I ran the code with 8 A100 GPUS. My environment is:

sys.platform            linux
Python                  3.8.15 (default, Nov  4 2022, 20:59:55) [GCC 11.2.0]
numpy                   1.23.4
detectron2              0.6 
Compiler                GCC 7.3
CUDA compiler           CUDA 11.3
detectron2 arch flags   3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.1 
PyTorch debug build     False
GPU available           Yes
GPU 0,1,2,3,4,5,6,7     NVIDIA A100-SXM4-80GB (arch=8.0)
Driver version          470.129.06
CUDA_HOME               cuda-11.1
Pillow                  8.3.2
torchvision             0.11.2 
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20210915
iopath                  0.1.9
cv2                     4.6.0
----------------------  -------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Thank you.

gkioxari commented 1 year ago

A few questions and comments:

Did you run the config with the same batch size, learning rate and schedule that we suggest? Deviating from the recipe we suggest will certainly change the behavior of the losses during training (as is expected)
Yes, occasionally we do encounter high losses during training. This is because an image might be out of distribution or have extreme annotations -- something that 3D suffers more from compared to 2D. For this reason, we provide checks and skip gradient updates for these cases. The model, given you use the recipe we provide, should have trained successfully though.

Hiusam commented 1 year ago

I ran Base_Omni3D_out.yaml provided in your repo without any modification. And it will restart training !! Restarting training at 51028 iters. Exploding loss 2% of iters !!. Maybe I should keep training and hope after some restarting, the training will be complete? :(
I tried using gradient clip, but it doesn't help.
Any plan to clear the dataset?

gkioxari commented 1 year ago

Large losses during training

And to confirm you ran with the same batch size. You should certainly keep training the model. We skip updates in the case of large losses to make training robust. The training should complete.

Gradient clip

Gradient clip is another way to secure your model from large losses (and thus large gradients). We chose to skip the updates; gradient clipping clips the gradients. Our way of skipping updates when losses are large is certainly less aggressive than gradient clipping which is why we prefer it.

Clear the dataset

@Hiusam this is not a dataset issue. There is nothing in the dataset to clear. 3D detection is simply much much harder than 2D detection. For instance, there are scenes with really far away objects (e.g. scenes with objects as far as 200m) in which case a wrong depth prediction in metric space will produce a large loss and thus large gradients. The solution is not to "clear" the dataset in any way, but to robustify training, which we do.

chenfengxu714 commented 1 year ago

Hi @gkioxari , nice work! I also encountered the same issue. Do you have an estimation of how many "retry" it usually needs? It seems that my experiments have been retried many times, e.g.., EST is 21 hours yet after two days it is still in retry. I also use the same Base_Omni3D_out config without any changes. Your suggestions would be very helpful!

Lizhuoling commented 1 year ago

I encountered the same problem. Without modifying the code, the training loss explodes during training in both indoor and outdoor scenes. I have tried resuming the experiments from the saved checkpoint, but it does not work. The training loss explodes soon again.

facebookresearch / omni3d

Loss explosion with outdoor data. #22