frgfm / torch-scan

Seamless analysis of your PyTorch models (RAM usage, FLOPs, MACs, receptive field, etc.)
https://frgfm.github.io/torch-scan/latest
Apache License 2.0
208 stars 22 forks source link

Negative RAM usage #63

Closed joonas-yoon closed 2 years ago

joonas-yoon commented 2 years ago

Bug description

I have been following DCGAN Tutorial with PyTorch, and ran in my jupyter environment.

I tried to show summary ant I got the result with negative RAM usage like Framework & CUDA overhead: -390.33 Mb

I have 4 GPUs but use only 1 GPU for this script by config:

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

here is the define of model.

nz = 200 # Size of z latent vector
ngf = 64 # Size of feature maps in generator
nc = 3 # Number of channels in the training images.

class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            nn.ConvTranspose2d(nz, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            nn.ConvTranspose2d(ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, input):
        return self.main(input)

Code snippet to reproduce the bug

the result of summary(netG, (nz, 1, 1)):

__________________________________________________________________________________________
Layer                        Type                  Output Shape              Param #        
==========================================================================================
generator                    Generator             (-1, 3, 64, 64)           0              
├─main                       Sequential            (-1, 3, 64, 64)           0              
|    └─0                     ConvTranspose2d       (-1, 512, 4, 4)           1,638,400      
|    └─1                     BatchNorm2d           (-1, 512, 4, 4)           2,049          
|    └─2                     ReLU                  (-1, 512, 4, 4)           0              
|    └─3                     ConvTranspose2d       (-1, 256, 8, 8)           2,097,152      
|    └─4                     BatchNorm2d           (-1, 256, 8, 8)           1,025          
|    └─5                     ReLU                  (-1, 256, 8, 8)           0              
|    └─6                     ConvTranspose2d       (-1, 128, 16, 16)         524,288        
|    └─7                     BatchNorm2d           (-1, 128, 16, 16)         513            
|    └─8                     ReLU                  (-1, 128, 16, 16)         0              
|    └─9                     ConvTranspose2d       (-1, 64, 32, 32)          131,072        
|    └─10                    BatchNorm2d           (-1, 64, 32, 32)          257            
|    └─11                    ReLU                  (-1, 64, 32, 32)          0              
|    └─12                    ConvTranspose2d       (-1, 3, 64, 64)           3,072          
|    └─13                    Tanh                  (-1, 3, 64, 64)           0              
==========================================================================================
Trainable params: 4,395,904
Non-trainable params: 0
Total params: 4,395,904
------------------------------------------------------------------------------------------
Model size (params + buffers): 16.78 Mb
Framework & CUDA overhead: -390.33 Mb
Total RAM usage: -373.56 Mb
------------------------------------------------------------------------------------------
Floating Point Operations on forward: 883.95 MFLOPs
Multiply-Accumulations on forward: 442.07 MMACs
Direct memory accesses on forward: 446.39 MDMAs
__________________________________________________________________________________________

and this is result for comparing with other module (torchsummary)

the result of torchsummary.summary(netG, (nz, 1, 1)):

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
   ConvTranspose2d-1            [-1, 512, 4, 4]       1,638,400
       BatchNorm2d-2            [-1, 512, 4, 4]           1,024
              ReLU-3            [-1, 512, 4, 4]               0
   ConvTranspose2d-4            [-1, 256, 8, 8]       2,097,152
       BatchNorm2d-5            [-1, 256, 8, 8]             512
              ReLU-6            [-1, 256, 8, 8]               0
   ConvTranspose2d-7          [-1, 128, 16, 16]         524,288
       BatchNorm2d-8          [-1, 128, 16, 16]             256
              ReLU-9          [-1, 128, 16, 16]               0
  ConvTranspose2d-10           [-1, 64, 32, 32]         131,072
      BatchNorm2d-11           [-1, 64, 32, 32]             128
             ReLU-12           [-1, 64, 32, 32]               0
  ConvTranspose2d-13            [-1, 3, 64, 64]           3,072
             Tanh-14            [-1, 3, 64, 64]               0
================================================================
Total params: 4,395,904
Trainable params: 4,395,904
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 3.00
Params size (MB): 16.77
Estimated Total Size (MB): 19.77
----------------------------------------------------------------

Error traceback

No error message.

Environment

I would prefer not to share environment details, sorry for having problem with security agreement.

frgfm commented 2 years ago

Hi @joonas-yoon :wave:

Thanks for reporting this! That's an interesting situation, my guess is that the GPU process memory computation has some issues with multi-GPU environments. So I see two things to do:

I'll try to solve this shortly, I may ask you to try the snippet on the upcoming fix branch to check whether this is the source of the problem if you don't mind :)

In the mean time, there are some missing imports & object instantiation in your snippet, could you make it fully executable please? (mostly interested in the model instantiation and whether you moved it to one of your GPUs)

joonas-yoon commented 2 years ago

Hi @frgfm

I got exactly same problem on kaggle notebook having 1 GPU.

here is the link, you can see the output: https://www.kaggle.com/joonasyoon/wgan-cp-with-celeba-and-lsun-dataset

When there is two model and summary them, first one make zero RAM usage, and second one has negative.

frgfm commented 2 years ago

Alright, I think I found a solution in #64 :+1: @joonas-yoon would you mind trying to install the "negative-ram" branch and check whether that solves your problem?

Side note: what you experienced in Kaggle (that the second model will have a different RAM value) won't be fixed by this as GPU RAM usage is reported as blended between all objects.

joonas-yoon commented 2 years ago

oh! that is good news. I will install in directly to check it :)

and thanks for information note.