Open ExtReMLapin opened 2 years ago
The behavior is expected, and as you noted, it is how pytorch works.
One addition we could add is to let trainer call empty_cache()
at the end of training - this sounds reasonable to me. However, users can easily do this themselves if they need. So this doesn't sound like something that has to be added to our trainer.
I can't think of any other reasonable solutions that can be added in detectron2 to change the memory allocation behavior. In particular I don't think we can "hook the garbage collector", whatever that means. Please tell us if you have any concrete suggestions.
Thank you for your answer, after a short investigation on how the PyTorch CUDA Memory manager works and how python GC works here is what I found :
PyObject_GC_Track(PyObject *op)
can be used to track an object and hook it's garbage collection.
As far as I understood, detectron2 is a pure python project so it's not a reasonable approach.
PyTorch has
torch.cuda.empty_cache()
torch.cuda.caching_allocator_alloc()
torch.cuda.caching_allocator_delete()
The two last function are the one that could be interesting but torch.cuda.caching_allocator_alloc()
isn't used at all as it uses torch.as_tensor()
While implementing __del__
with manually calling torch.cuda.empty_cache()
inside sounds inelegant, maybe calling manually right after the training finished is the best way to handle this problem ?
Hello,
Long story short : Detectron isn't freeing the VRAM torch allocations after training and on a system with multiple instances, it can lead to incorrect report of "no more memory available"
Instructions To Reproduce the Issue:
see attached code, it was trained with
faster_rcnn_R_50_FPN_3x
but the behavior should be the same with any other model from Zoo.solotrainer.zip
VRAM USAGE
torch.cuda.empty_cache()
: 1.4Gb (Forcing garbage collection doesn't do anything)Expected behavior:
From what I understood, torch uses a process-wide memory allocation system instead of a system-wide allocation system (for obvious reasons).
The thing is, -from what i understand- doing this, is taking the risk of having multiple torch memory allocators on one system.
Process A1 takes 6Gb of vram but doesn't automatically releases them because "it could need it later", and same for process A2 running another torch project.
Process B1 tries to allocate 3Gb of VRAM but cuda won't because it has no "free vram page" to distribute, while in reality A1 and A2 are just reserving VRAM for nothing.
This is why as GPUs don't have yet a proper memory managment system and torch isn't a system-wide allocator, you should probably free memory as soon as you can using either
torch.cuda.empty_cache()
or any other torch function call that could manually free pages/memory blocks.I would have expected you to plug the
DefaultTrainer
garbage collection totorch.cuda.empty_cache()
or a more specific memory "liberator" function.You can hook garbage collection using
ffi.gc
(not directly on theDefaultTrainer
class, obviously but on the specific variable that -by garbage collection cascade effect-, will be collected)Environment:
Tested on W10 x64 + GTX 1080 detectron2 @ ef2c3abbd36d4093a604f874243037691f634c2f
Also tested on Ubuntu with two Tesla V100S 32Gb
W10 env :
Please keep understand I'm new to this GPU Calculation/AI computer science subset and I may have misunderstood something.