Memory estimation inconsistent with actual GPU memory utilization

rodrigovimieiro commented 2 years ago

Describe the bug Memory estimation inconsistent with actual GPU memory utilization

To Reproduce

I am using a simple UNet with 2 layers (same as here).
The input size is (1, 1, 4096, 3328)

Expected behavior When forwarding an image of size (1, 1, 4096, 3328) in testing mode, i.e., model.eval() on, the reported GPU memory is approximatly 15GB:

Screenshot from 2022-07-01 10-44-52

However, torchinfo.summary reports 50GB, even though eval is passed as argument:

summary(model, input_size=(1, 1, 4096, 3328), mode='eval', device=device)

Screenshot from 2022-07-01 10-46-48

devrimcavusoglu commented 1 year ago

@rodrigovimieiro I'm not a maintainer, nor a contributer, but have you tried model.cuda() as input ?

rodrigovimieiro commented 1 year ago

@devrimcavusoglu I don't have enough GPU memory for the model. That's why I was trying to estimate it

mert-kurttutan commented 1 year ago

@rodrigovimieiro Hi, Memory allocation inside summary differs from estimated total size for several reasons. First, summary runs functions using torch.no_grad() (see forward_pass function in torchinfo.py). So, it does not calculate and store gradient values (I think intermediate output values are still stored, but I am not sure).

I think if you try one step of training it will give something reasonably close to 50 GB. (Or you can try the same thing with smaller input size)

Another issue is that predictions based on analytical result can have errors as high as 30% on actual GPU memory utilization.

Regarding this, you can take look at this paper here. It also gives some formal description of memory utilization during forward/backward prop

I also tried your case with the following input choice

batch_size = 3
data_shape = (1, 1206, 333)
random_data = torch.rand((batch_size, *data_shape)).to('cuda')

and obtained the following results, which seems close (at least for this choice parameters). I had larger discrepancies for other cases, but they were not as significant as your example.

====================================================================================================
Layer (type (var_name):depth-idx)                  Output Shape              Param #
====================================================================================================
UNet2 (UNet2)                                      [3, 1, 1206, 333]         --
├─DoubleConv (inc): 1-1                            [3, 96, 1206, 333]        --
│    └─Sequential (double_conv): 2-1               [3, 96, 1206, 333]        --
│    │    └─Conv2d (0): 3-1                        [3, 96, 1206, 333]        960
│    │    └─BatchNorm2d (1): 3-2                   [3, 96, 1206, 333]        192
│    │    └─ReLU (5): 3-3                          [3, 96, 1206, 333]        --
│    │    └─Conv2d (3): 3-4                        [3, 96, 1206, 333]        83,040
│    │    └─BatchNorm2d (4): 3-5                   [3, 96, 1206, 333]        192
│    │    └─ReLU (5): 3-6                          [3, 96, 1206, 333]        --
├─Down (down1): 1-2                                [3, 192, 603, 166]        --
│    └─Sequential (maxpool_conv): 2-2              [3, 192, 603, 166]        --
│    │    └─MaxPool2d (0): 3-7                     [3, 96, 603, 166]         --
│    │    └─DoubleConv (1): 3-8                    [3, 192, 603, 166]        --
│    │    │    └─Sequential (double_conv): 4-1     [3, 192, 603, 166]        --
│    │    │    │    └─Conv2d (0): 5-1              [3, 192, 603, 166]        166,080
│    │    │    │    └─BatchNorm2d (1): 5-2         [3, 192, 603, 166]        384
│    │    │    │    └─ReLU (5): 5-3                [3, 192, 603, 166]        --
│    │    │    │    └─Conv2d (3): 5-4              [3, 192, 603, 166]        331,968
│    │    │    │    └─BatchNorm2d (4): 5-5         [3, 192, 603, 166]        384
│    │    │    │    └─ReLU (5): 5-6                [3, 192, 603, 166]        --
├─Down (down2): 1-3                                [3, 384, 301, 83]         --
│    └─Sequential (maxpool_conv): 2-3              [3, 384, 301, 83]         --
│    │    └─MaxPool2d (0): 3-9                     [3, 192, 301, 83]         --
│    │    └─DoubleConv (1): 3-10                   [3, 384, 301, 83]         --
│    │    │    └─Sequential (double_conv): 4-2     [3, 384, 301, 83]         --
│    │    │    │    └─Conv2d (0): 5-7              [3, 384, 301, 83]         663,936
│    │    │    │    └─BatchNorm2d (1): 5-8         [3, 384, 301, 83]         768
│    │    │    │    └─ReLU (5): 5-9                [3, 384, 301, 83]         --
│    │    │    │    └─Conv2d (3): 5-10             [3, 384, 301, 83]         1,327,488
│    │    │    │    └─BatchNorm2d (4): 5-11        [3, 384, 301, 83]         768
│    │    │    │    └─ReLU (5): 5-12               [3, 384, 301, 83]         --
├─Up (up1): 1-4                                    [3, 192, 603, 166]        --
│    └─ConvTranspose2d (up): 2-4                   [3, 192, 602, 166]        295,104
│    └─DoubleConv (conv): 2-5                      [3, 192, 603, 166]        --
│    │    └─Sequential (double_conv): 3-11         [3, 192, 603, 166]        --
│    │    │    └─Conv2d (0): 4-3                   [3, 192, 603, 166]        663,744
│    │    │    └─BatchNorm2d (1): 4-4              [3, 192, 603, 166]        384
│    │    │    └─ReLU (5): 4-5                     [3, 192, 603, 166]        --
│    │    │    └─Conv2d (3): 4-6                   [3, 192, 603, 166]        331,968
│    │    │    └─BatchNorm2d (4): 4-7              [3, 192, 603, 166]        384
│    │    │    └─ReLU (5): 4-8                     [3, 192, 603, 166]        --
├─Up (up2): 1-5                                    [3, 96, 1206, 333]        --
│    └─ConvTranspose2d (up): 2-6                   [3, 96, 1206, 332]        73,824
│    └─DoubleConv (conv): 2-7                      [3, 96, 1206, 333]        --
│    │    └─Sequential (double_conv): 3-12         [3, 96, 1206, 333]        --
│    │    │    └─Conv2d (0): 4-9                   [3, 96, 1206, 333]        165,984
│    │    │    └─BatchNorm2d (1): 4-10             [3, 96, 1206, 333]        192
│    │    │    └─ReLU (5): 4-11                    [3, 96, 1206, 333]        --
│    │    │    └─Conv2d (3): 4-12                  [3, 96, 1206, 333]        83,040
│    │    │    └─BatchNorm2d (4): 4-13             [3, 96, 1206, 333]        192
│    │    │    └─ReLU (5): 4-14                    [3, 96, 1206, 333]        --
├─OutConv (outc): 1-6                              [3, 1, 1206, 333]         --
│    └─Conv2d (conv): 2-8                          [3, 1, 1206, 333]         97
====================================================================================================
Total params: 4,191,073
Trainable params: 4,191,073
Non-trainable params: 0
Total mult-adds (T): 1.18
====================================================================================================
Input size (MB): 4.82
Forward/backward pass size (MB): 13405.87
Params size (MB): 16.76
Estimated Total Size (MB): 13427.45
====================================================================================================

Actual GPU profile:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    34W /  70W |  12442MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

rodrigovimieiro commented 1 year ago

Hi @mert-kurttutan,

Thanks for the information. Could you try with the image resolution I posted, please?

mert-kurttutan commented 1 year ago

Yeah, I do get the same summary results as you. But to include the effect of forward/backward pass in nvidia-smi results, we need to run it in grad enabled and run backward. But, I dont have big enough gpu(s) to do this, it gives OOM error immediately

rodrigovimieiro commented 1 year ago

Yes, but I am setting summary to be on eval mode, so we don't calculate gradients and I would expect similar results to the real one, right?

summary(model, input_size=(1, 1, 4096, 3328), mode='eval', device=device)

mert-kurttutan commented 1 year ago

Actually, gradient is not calculated in any of the modes since torch.no_grad is used for both train and eval mode, see forward_pass function in torchinfo.py. I also checked. Gpu memory usage remains the same when changing the mode.

TylerYep / torchinfo

Memory estimation inconsistent with actual GPU memory utilization #149