amazon-science / siam-mot

SiamMOT: Siamese Multi-Object Tracking
Apache License 2.0
477 stars 61 forks source link

Variable memory requirements #35

Closed mondrasovic closed 2 years ago

mondrasovic commented 2 years ago

I have noticed that the memory requirements for the model change depending on whether the training starts from a freshly initialized model or a model initialized from a checkpoint.

I am training the model on NVidia RTX 2080Ti GPU, which provides 11GB of memory. In order to start the training without running into RuntimeError: CUDA error: out of memory exception, I need to set the number of video clips per batch equal to 3. More specifically, in terms of configuration settings:

SOLVER:
  VIDEO_CLIPS_PER_BATCH: 3

This produces the batch size equal to 6, since we have 2 random frames per clip, as given by the configuration below:

VIDEO:
  RANDOM_FRAMES_PER_CLIP: 2

However. if I restart the training from a previously stored checkpoint, the memory consumption decreases to such an extent, that I can add one more video clip per batch without crashing due to insufficient memory capacity. More concretely, my configuration allows the following:

SOLVER:
  VIDEO_CLIPS_PER_BATCH: 4

This does not seem to influence the model performance after training.

I have tried explicitly calling the garbage collector and emptying the CUDA cache using

import gc
import torch

gc.collect()
torch.cuda.empty_cache()

but to no avail.

My question is. What do you think might be causing this sort of memory leak? I have been working on this architecture for some time and yet I haven't found a reasonable explanation so far.

At this point, my pipeline involves two separate configurations. First, I run the training for 100 iterations, save the checkpoint, halt the training, and then restart it with a different configuration allowing a bigger batch size, and let it train as required. It is pretty cumbersome as well as highly unprofessional. I would like to understand the underlying cause.

Thank you for your input.

L-Ramos commented 2 years ago

Have you checked what DTYPE in the config file is? There's a check at the start of the train_net.py code that sees if it is float16 or 32. If set to 16 it will use mixed precision to keep operations within a float16 range. If you do everything with float32 then it will take a lot more memory.

That's what my guess is, maybe it could be caused by something else but it is worth checking.

mondrasovic commented 2 years ago

Have you checked what DTYPE in the config file is? There's a check at the start of the train_net.py code that sees if it is float16 or 32. If set to 16 it will use mixed precision to keep operations within a float16 range. If you do everything with float32 then it will take a lot more memory.

No, this is certainly not the case. I have explicitly paid attention to it. Moreover, as I said, only one variable changes, and that is the number of video clips per batch. All the rest remains constant so that I can clearly see the effect. Anyway, thanks a lot.

I have explicitly checked the amount of memory allocated for both "versions" before and after backward, and there really is a difference. I employed the pytorch_memlab package to accomplish this.

BEFORE BACKWARD
Version: fresh init
  Total Tensors: 118041908  Used Memory: 450.38M
  The allocated memory on cuda:0: 7.14G
  Memory differs due to the matrix alignment or invisible gradient buffer tensors

Version: loaded from checkpoint
  Total Tensors: 118041938  Used Memory: 450.38M
  The allocated memory on cuda:0: 6.61G
  Memory differs due to the matrix alignment or invisible gradient buffer tensors

---------------------------------------------------------------------------------
AFTER BACKWARD
Version: fresh init
  Total Tensors: 118041907  Used Memory: 450.38M
  The allocated memory on cuda:0: 456.13M
  Memory differs due to the matrix alignment or invisible gradient buffer tensors

Version: loaded from checkpoint
  Total Tensors: 118041937  Used Memory: 450.38M
  The allocated memory on cuda:0: 457.76M
  Memory differs due to the matrix alignment or invisible gradient buffer tensors

The difference can be consistently replicated on Google Colab, on my local machine as well as on the university server. But I simply do not know what may be causing it, since it is just "pure loading of weights".

mondrasovic commented 2 years ago

I have just "figured it out". The problem is that the batch does have variable memory requirements, even though the batch size remains constant. For example, the number of detection proposals changes and so on. Once in a while, the training process may produce a batch that simply does not fit your memory. I have seen this happening right in front of my eyes when debugging the training algorithm and trying to understand some parts of it a little deeper. So, all in all, my concern regarding variable memory requirements was right. My conclusive recommendation is to have a little more reserve in the GPU memory to avoid crashing the training after dozens of hours of training just because of a few extra bytes. It would be nice to estimate the upper bound for the possible memory requirements. That sounds like a task for me to ponder as part of digging into this architecture.

Leo63963 commented 2 years ago

Hi @mondrasovic Thanks for your work, it really helps, especially for those who have limited computing resources (such as me). Just a quick follow up, I have 2 2080Ti GPU, and RuntimeError: CUDA error: out of memory happens all the time if I used the original configs provided by the author. Just kindly ask, if I would like to train and reproduce the results on MOT17 on 2 2080Ti GPU, what are the proper parameters? I tried with a smaller VIDEO_CLIPS_PER_BATCH (4), small BASE_LR (0.005) (reduce proportionally according to batch size), and reduce the NUM_WORKERS (2). And I also utilized the ""crowdhuman_train_fbox", and "crowdhuman_val_fbox"" that you provided (thanks for that). What else should I do? Should I add "MOT17" on TRAIN after ""crowdhuman_train_fbox", "crowdhuman_val_fbox""? Thanks.

mondrasovic commented 2 years ago

Hi.

I have just answered this partially in another issue.

But to focus more on the hardware aspect here. I had the very same hardware setup as you. Two Nvidia GeForce RTX2080Ti. And have I managed to reproduce the results during one year of experiments as part of my Ph.D.? No.

So, I encourage you to think twice before trying to achieve the same thing, unless you think you come up with some cunning plan how to circumvent the obstacle of limited hardware. Or you could just assume that I made mistakes and you would try it anyway, which is a perfectly reasonable assumption. If that is the case, then good luck, and certainly keep me posted regarding your progress.

Leo63963 commented 2 years ago

Hi @mondrasovic Thanks for your reply. I saw your reply at that issue, and I will concentrate on that issue for further updates. Thanks.

noreenanwar commented 2 years ago

did u train it with 1 gpu?

Leo63963 commented 2 years ago

Two GPUs