MadryLab / robustness

A library for experimenting with, training and evaluating neural networks, with a focus on adversarial robustness.
MIT License
903 stars 181 forks source link

Robust VGG19 on 3GB VRAM #82

Closed GlebSBrykin closed 3 years ago

GlebSBrykin commented 3 years ago

Once again, greetings to all present! All this time, I have been studying the capabilities of the ready-made robust resnet50, presented by the authors of this work, and compared it with the classic resnet50. The features of the robust model are fantastic compared to the regular model! In addition, I was thinking about how to train VGG19 to make it even more robust in my PC environment. So, there is 3 GB of VRAM and 8 GB of RAM. The ideas so far are the following in addition to what the authors suggested to me in the last Issue:

  1. Move the fully connected part to the CPU, while the convolutional part will be on the GPU. The purpose of this action is to unload video memory. Convolutional layers contain a small number of parameters, their volume is about 80 MB, but the computational complexity of convolusions is very high. Therefore, it is important to train them on the GPU. In contrast, fully connected layers have extremely low computational complexity, but the amount of memory used by the parameters is huge - almost 500 MB. In this regard, we can use the CPU and less expensive RAM.
  2. Divide a large batch into small parts and make several forward and reverse passes before the optimizer step. For example, if training requires a batch size of 512, then we can perform 64 straight-back passes for batches of size 8 in a row and only then call the optimizer step. Given that there are no normalization layers in VGG, this division will not affect the learning process and its stability, but it will radically reduce the amount of memory required without compromising the quality of work. I would like to know the opinion of the authors and other participants on this issue.
dtsip commented 3 years ago

I would go for the second solution since it is easier to implement (pytorch makes aggregating model updates quite easy, you just need to call backward() multiple times before each update).

GlebSBrykin commented 3 years ago

What about option 1? It will save more than 1 GB of video memory, i.e. in my case exactly 1/3 of the total VRAM.

dtsip commented 3 years ago

I think that this is significantly trickier to implement. It requires doing a separate forward pass for linear layer and another for the rest of the network, and them combining the backward passes from each part to perform a model update. Seems much more error prone.

GlebSBrykin commented 3 years ago

I decided to start training something easy. For example, SqueezeNet. When trying to start the process on places365, I encountered a problem: the program swears at the structure of the dataset. It seems that the downloaded and unpacked directories need to be redistributed somehow, but how?

dtsip commented 3 years ago

The directory structure should mirror that of the pytorch-style ImageNet. If I remember correctly there are options available on the Places website that have a more friendly directory structure.

GlebSBrykin commented 3 years ago

Well, I will try to do so.

andrewilyas commented 3 years ago

Closing this for now, feel free to open another issue with any further questions!