Detectron2 should free memory after training.

ExtReMLapin commented 2 years ago

Hello,

Long story short : Detectron isn't freeing the VRAM torch allocations after training and on a system with multiple instances, it can lead to incorrect report of "no more memory available"

Instructions To Reproduce the Issue:

see attached code, it was trained with faster_rcnn_R_50_FPN_3x but the behavior should be the same with any other model from Zoo.

solotrainer.zip

VRAM USAGE

Before training : 0.6Gb
During training : 4.8Gb
After training finished (didn't close python process) : 4.8Gb
After manually calling torch.cuda.empty_cache() : 1.4Gb (Forcing garbage collection doesn't do anything)
Closing process : 0.6Gb

Expected behavior:

From what I understood, torch uses a process-wide memory allocation system instead of a system-wide allocation system (for obvious reasons).

The thing is, -from what i understand- doing this, is taking the risk of having multiple torch memory allocators on one system.

Process A1 takes 6Gb of vram but doesn't automatically releases them because "it could need it later", and same for process A2 running another torch project.

Process B1 tries to allocate 3Gb of VRAM but cuda won't because it has no "free vram page" to distribute, while in reality A1 and A2 are just reserving VRAM for nothing.

This is why as GPUs don't have yet a proper memory managment system and torch isn't a system-wide allocator, you should probably free memory as soon as you can using either torch.cuda.empty_cache() or any other torch function call that could manually free pages/memory blocks.

I would have expected you to plug the DefaultTrainer garbage collection to torch.cuda.empty_cache() or a more specific memory "liberator" function.

You can hook garbage collection using ffi.gc (not directly on the DefaultTrainer class, obviously but on the specific variable that -by garbage collection cascade effect-, will be collected)

Environment:

Tested on W10 x64 + GTX 1080 detectron2 @ ef2c3abbd36d4093a604f874243037691f634c2f

Also tested on Ubuntu with two Tesla V100S 32Gb

W10 env :

----------------------  -----------------------------------------------------------------------------------------------------------------
sys.platform            win32
Python                  3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
numpy                   1.22.2
detectron2              0.6 @e:\CENSORED\detectron2\detectron2
detectron2._C           not built correctly: DLL load failed while importing _C: La procédure spécifiée est introuvable.
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.10.2+cu113 @C:\Users\xxx\AppData\Roaming\Python\Python38\site-packages\torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   NVIDIA GeForce GTX 1080 (arch=6.1)
Driver version          511.23
CUDA_HOME               C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6
Pillow                  8.2.0
torchvision             0.11.3+cu113 @C:\Users\xxx\AppData\Roaming\Python\Python38\site-packages\torchvision
torchvision arch flags  C:\Users\xxx\AppData\Roaming\Python\Python38\site-packages\torchvision\_C.pyd; cannot find cuobjdump
fvcore                  0.1.5.post20211023
iopath                  0.1.9
cv2                     4.5.2
----------------------  -----------------------------------------------------------------------------------------------------------------
PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/w/b/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/w/b/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

Please keep understand I'm new to this GPU Calculation/AI computer science subset and I may have misunderstood something.

ppwwyyxx commented 2 years ago

The behavior is expected, and as you noted, it is how pytorch works.

One addition we could add is to let trainer call empty_cache() at the end of training - this sounds reasonable to me. However, users can easily do this themselves if they need. So this doesn't sound like something that has to be added to our trainer.

I can't think of any other reasonable solutions that can be added in detectron2 to change the memory allocation behavior. In particular I don't think we can "hook the garbage collector", whatever that means. Please tell us if you have any concrete suggestions.

ExtReMLapin commented 2 years ago

Thank you for your answer, after a short investigation on how the PyTorch CUDA Memory manager works and how python GC works here is what I found :

PyObject_GC_Track(PyObject *op) can be used to track an object and hook it's garbage collection. As far as I understood, detectron2 is a pure python project so it's not a reasonable approach.
PyTorch has
- torch.cuda.empty_cache()
- torch.cuda.caching_allocator_alloc()
- torch.cuda.caching_allocator_delete() The two last function are the one that could be interesting but torch.cuda.caching_allocator_alloc() isn't used at all as it uses torch.as_tensor()

While implementing __del__ with manually calling torch.cuda.empty_cache() inside sounds inelegant, maybe calling manually right after the training finished is the best way to handle this problem ?

facebookresearch / detectron2