Open rodrigovimieiro opened 2 years ago
@rodrigovimieiro I'm not a maintainer, nor a contributer, but have you tried model.cuda()
as input ?
@devrimcavusoglu I don't have enough GPU memory for the model. That's why I was trying to estimate it
@rodrigovimieiro Hi,
Memory allocation inside summary
differs from estimated total size for several reasons.
First, summary
runs functions using torch.no_grad()
(see forward_pass
function in torchinfo.py
). So, it does not calculate and store gradient values (I think intermediate output values are still stored, but I am not sure).
I think if you try one step of training it will give something reasonably close to 50 GB. (Or you can try the same thing with smaller input size)
Another issue is that predictions based on analytical result can have errors as high as 30% on actual GPU memory utilization.
Regarding this, you can take look at this paper here. It also gives some formal description of memory utilization during forward/backward prop
I also tried your case with the following input choice
batch_size = 3
data_shape = (1, 1206, 333)
random_data = torch.rand((batch_size, *data_shape)).to('cuda')
and obtained the following results, which seems close (at least for this choice parameters). I had larger discrepancies for other cases, but they were not as significant as your example.
====================================================================================================
Layer (type (var_name):depth-idx) Output Shape Param #
====================================================================================================
UNet2 (UNet2) [3, 1, 1206, 333] --
├─DoubleConv (inc): 1-1 [3, 96, 1206, 333] --
│ └─Sequential (double_conv): 2-1 [3, 96, 1206, 333] --
│ │ └─Conv2d (0): 3-1 [3, 96, 1206, 333] 960
│ │ └─BatchNorm2d (1): 3-2 [3, 96, 1206, 333] 192
│ │ └─ReLU (5): 3-3 [3, 96, 1206, 333] --
│ │ └─Conv2d (3): 3-4 [3, 96, 1206, 333] 83,040
│ │ └─BatchNorm2d (4): 3-5 [3, 96, 1206, 333] 192
│ │ └─ReLU (5): 3-6 [3, 96, 1206, 333] --
├─Down (down1): 1-2 [3, 192, 603, 166] --
│ └─Sequential (maxpool_conv): 2-2 [3, 192, 603, 166] --
│ │ └─MaxPool2d (0): 3-7 [3, 96, 603, 166] --
│ │ └─DoubleConv (1): 3-8 [3, 192, 603, 166] --
│ │ │ └─Sequential (double_conv): 4-1 [3, 192, 603, 166] --
│ │ │ │ └─Conv2d (0): 5-1 [3, 192, 603, 166] 166,080
│ │ │ │ └─BatchNorm2d (1): 5-2 [3, 192, 603, 166] 384
│ │ │ │ └─ReLU (5): 5-3 [3, 192, 603, 166] --
│ │ │ │ └─Conv2d (3): 5-4 [3, 192, 603, 166] 331,968
│ │ │ │ └─BatchNorm2d (4): 5-5 [3, 192, 603, 166] 384
│ │ │ │ └─ReLU (5): 5-6 [3, 192, 603, 166] --
├─Down (down2): 1-3 [3, 384, 301, 83] --
│ └─Sequential (maxpool_conv): 2-3 [3, 384, 301, 83] --
│ │ └─MaxPool2d (0): 3-9 [3, 192, 301, 83] --
│ │ └─DoubleConv (1): 3-10 [3, 384, 301, 83] --
│ │ │ └─Sequential (double_conv): 4-2 [3, 384, 301, 83] --
│ │ │ │ └─Conv2d (0): 5-7 [3, 384, 301, 83] 663,936
│ │ │ │ └─BatchNorm2d (1): 5-8 [3, 384, 301, 83] 768
│ │ │ │ └─ReLU (5): 5-9 [3, 384, 301, 83] --
│ │ │ │ └─Conv2d (3): 5-10 [3, 384, 301, 83] 1,327,488
│ │ │ │ └─BatchNorm2d (4): 5-11 [3, 384, 301, 83] 768
│ │ │ │ └─ReLU (5): 5-12 [3, 384, 301, 83] --
├─Up (up1): 1-4 [3, 192, 603, 166] --
│ └─ConvTranspose2d (up): 2-4 [3, 192, 602, 166] 295,104
│ └─DoubleConv (conv): 2-5 [3, 192, 603, 166] --
│ │ └─Sequential (double_conv): 3-11 [3, 192, 603, 166] --
│ │ │ └─Conv2d (0): 4-3 [3, 192, 603, 166] 663,744
│ │ │ └─BatchNorm2d (1): 4-4 [3, 192, 603, 166] 384
│ │ │ └─ReLU (5): 4-5 [3, 192, 603, 166] --
│ │ │ └─Conv2d (3): 4-6 [3, 192, 603, 166] 331,968
│ │ │ └─BatchNorm2d (4): 4-7 [3, 192, 603, 166] 384
│ │ │ └─ReLU (5): 4-8 [3, 192, 603, 166] --
├─Up (up2): 1-5 [3, 96, 1206, 333] --
│ └─ConvTranspose2d (up): 2-6 [3, 96, 1206, 332] 73,824
│ └─DoubleConv (conv): 2-7 [3, 96, 1206, 333] --
│ │ └─Sequential (double_conv): 3-12 [3, 96, 1206, 333] --
│ │ │ └─Conv2d (0): 4-9 [3, 96, 1206, 333] 165,984
│ │ │ └─BatchNorm2d (1): 4-10 [3, 96, 1206, 333] 192
│ │ │ └─ReLU (5): 4-11 [3, 96, 1206, 333] --
│ │ │ └─Conv2d (3): 4-12 [3, 96, 1206, 333] 83,040
│ │ │ └─BatchNorm2d (4): 4-13 [3, 96, 1206, 333] 192
│ │ │ └─ReLU (5): 4-14 [3, 96, 1206, 333] --
├─OutConv (outc): 1-6 [3, 1, 1206, 333] --
│ └─Conv2d (conv): 2-8 [3, 1, 1206, 333] 97
====================================================================================================
Total params: 4,191,073
Trainable params: 4,191,073
Non-trainable params: 0
Total mult-adds (T): 1.18
====================================================================================================
Input size (MB): 4.82
Forward/backward pass size (MB): 13405.87
Params size (MB): 16.76
Estimated Total Size (MB): 13427.45
====================================================================================================
Actual GPU profile:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 75C P0 34W / 70W | 12442MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Hi @mert-kurttutan,
Thanks for the information. Could you try with the image resolution I posted, please?
Yeah, I do get the same summary results as you. But to include the effect of forward/backward pass in nvidia-smi results, we need to run it in grad enabled and run backward. But, I dont have big enough gpu(s) to do this, it gives OOM error immediately
Yes, but I am setting summary to be on eval mode, so we don't calculate gradients and I would expect similar results to the real one, right?
summary(model, input_size=(1, 1, 4096, 3328), mode='eval', device=device)
Actually, gradient is not calculated in any of the modes since torch.no_grad
is used for both train and eval mode, see forward_pass
function in torchinfo.py.
I also checked. Gpu memory usage remains the same when changing the mode.
Describe the bug Memory estimation inconsistent with actual GPU memory utilization
To Reproduce
Expected behavior When forwarding an image of size
(1, 1, 4096, 3328)
in testing mode, i.e.,model.eval()
on, the reported GPU memory is approximatly 15GB:However,
torchinfo.summary
reports 50GB, even though eval is passed as argument:summary(model, input_size=(1, 1, 4096, 3328), mode='eval', device=device)