Closed mondrasovic closed 2 years ago
Have you checked what DTYPE in the config file is? There's a check at the start of the train_net.py code that sees if it is float16 or 32. If set to 16 it will use mixed precision to keep operations within a float16 range. If you do everything with float32 then it will take a lot more memory.
That's what my guess is, maybe it could be caused by something else but it is worth checking.
Have you checked what DTYPE in the config file is? There's a check at the start of the train_net.py code that sees if it is float16 or 32. If set to 16 it will use mixed precision to keep operations within a float16 range. If you do everything with float32 then it will take a lot more memory.
No, this is certainly not the case. I have explicitly paid attention to it. Moreover, as I said, only one variable changes, and that is the number of video clips per batch. All the rest remains constant so that I can clearly see the effect. Anyway, thanks a lot.
I have explicitly checked the amount of memory allocated for both "versions" before and after backward, and there really is a difference. I employed the pytorch_memlab package to accomplish this.
BEFORE BACKWARD
Version: fresh init
Total Tensors: 118041908 Used Memory: 450.38M
The allocated memory on cuda:0: 7.14G
Memory differs due to the matrix alignment or invisible gradient buffer tensors
Version: loaded from checkpoint
Total Tensors: 118041938 Used Memory: 450.38M
The allocated memory on cuda:0: 6.61G
Memory differs due to the matrix alignment or invisible gradient buffer tensors
---------------------------------------------------------------------------------
AFTER BACKWARD
Version: fresh init
Total Tensors: 118041907 Used Memory: 450.38M
The allocated memory on cuda:0: 456.13M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
Version: loaded from checkpoint
Total Tensors: 118041937 Used Memory: 450.38M
The allocated memory on cuda:0: 457.76M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
The difference can be consistently replicated on Google Colab, on my local machine as well as on the university server. But I simply do not know what may be causing it, since it is just "pure loading of weights".
I have just "figured it out". The problem is that the batch does have variable memory requirements, even though the batch size remains constant. For example, the number of detection proposals changes and so on. Once in a while, the training process may produce a batch that simply does not fit your memory. I have seen this happening right in front of my eyes when debugging the training algorithm and trying to understand some parts of it a little deeper. So, all in all, my concern regarding variable memory requirements was right. My conclusive recommendation is to have a little more reserve in the GPU memory to avoid crashing the training after dozens of hours of training just because of a few extra bytes. It would be nice to estimate the upper bound for the possible memory requirements. That sounds like a task for me to ponder as part of digging into this architecture.
Hi @mondrasovic Thanks for your work, it really helps, especially for those who have limited computing resources (such as me). Just a quick follow up, I have 2 2080Ti GPU, and RuntimeError: CUDA error: out of memory happens all the time if I used the original configs provided by the author. Just kindly ask, if I would like to train and reproduce the results on MOT17 on 2 2080Ti GPU, what are the proper parameters? I tried with a smaller VIDEO_CLIPS_PER_BATCH (4), small BASE_LR (0.005) (reduce proportionally according to batch size), and reduce the NUM_WORKERS (2). And I also utilized the ""crowdhuman_train_fbox", and "crowdhuman_val_fbox"" that you provided (thanks for that). What else should I do? Should I add "MOT17" on TRAIN after ""crowdhuman_train_fbox", "crowdhuman_val_fbox""? Thanks.
Hi.
I have just answered this partially in another issue.
But to focus more on the hardware aspect here. I had the very same hardware setup as you. Two Nvidia GeForce RTX2080Ti. And have I managed to reproduce the results during one year of experiments as part of my Ph.D.? No.
So, I encourage you to think twice before trying to achieve the same thing, unless you think you come up with some cunning plan how to circumvent the obstacle of limited hardware. Or you could just assume that I made mistakes and you would try it anyway, which is a perfectly reasonable assumption. If that is the case, then good luck, and certainly keep me posted regarding your progress.
Hi @mondrasovic Thanks for your reply. I saw your reply at that issue, and I will concentrate on that issue for further updates. Thanks.
did u train it with 1 gpu?
Two GPUs
I have noticed that the memory requirements for the model change depending on whether the training starts from a freshly initialized model or a model initialized from a checkpoint.
I am training the model on NVidia RTX 2080Ti GPU, which provides 11GB of memory. In order to start the training without running into
RuntimeError: CUDA error: out of memory
exception, I need to set the number of video clips per batch equal to3
. More specifically, in terms of configuration settings:This produces the batch size equal to 6, since we have 2 random frames per clip, as given by the configuration below:
However. if I restart the training from a previously stored checkpoint, the memory consumption decreases to such an extent, that I can add one more video clip per batch without crashing due to insufficient memory capacity. More concretely, my configuration allows the following:
This does not seem to influence the model performance after training.
I have tried explicitly calling the garbage collector and emptying the CUDA cache using
but to no avail.
My question is. What do you think might be causing this sort of memory leak? I have been working on this architecture for some time and yet I haven't found a reasonable explanation so far.
At this point, my pipeline involves two separate configurations. First, I run the training for 100 iterations, save the checkpoint, halt the training, and then restart it with a different configuration allowing a bigger batch size, and let it train as required. It is pretty cumbersome as well as highly unprofessional. I would like to understand the underlying cause.
Thank you for your input.