Out of memory issue when training a new dataset

vponcelo commented 6 years ago

Hi,

I am attempting to reproduce your code for the CondenseNet for training a dataset of 7 classes, and approximately between 100K - 150K training images splitted (non-equally) for those clases. My images consist of bounding boxes of different sizes. For that, first I'm using a similar setting you use to train the ImageNet, pointing to my dataset and preparing the class folders to find the paths properly. I resized all images to 256x256 as you did in your paper. Therefore, this is the command line I use for training the new dataset:

python main.py --model condensenet -b 256 -j 28 lima_train --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0 --resume

where lima_train is a link file pointing to the folder containing all training data splitted in class subfolders as required.

I'm using a datacenter whose GPU nodes use NVIDIA Tesla P100 of 16 GB each, and CUDA 8 with cuDNN. In this sense, I presume the training should not be a problem. I understand that a GPU of 16GB or even 8GB should be enough to train this network, shouldn't be? However, I'm getting the out of memory problem shown below. I modified the parameters to reduce the batch size to 64 and the number of workers according to the machine. Probably I am missing some step or I should modify the command line according to the settings of my data.

I would appreciate any feedback.

Thanks in advance and congratulations for this work.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "main.py", line 479, in <module> main() File "main.py", line 239, in main train(train_loader, model, criterion, optimizer, epoch) File "main.py", line 303, in train output = model(input_var, progress) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward return self.module(*inputs[0], **kwargs[0]) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 127, in forward features = self.features(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 33, in forward x = self.conv_1(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/layers.py", line 42, in forward x = self.norm(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 37, in forward self.training, self.momentum, self.eps) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/functional.py", line 639, in batch_norm return f(input, weight, bias) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66 srun: error: gpu09: task 0: Exited with exit code 1

ShichenLiu commented 6 years ago

Hi, as for ImageNet models, we used four 12GB titan X (pascal) to train them. So perhaps you need to change your command to python main.py --model condensenet -b 256 -j 28 lima_train --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0,1,2,3 --resume (maybe you only need three 16GB GPUs in your case). Hope this could help you.

vponcelo commented 6 years ago

Hi @ShichenLiu , thanks a lot for your fast answer. Thus, the out of memory may appear because I am using only 1 gpu for training, right?

ShichenLiu commented 6 years ago

I believe so.

vponcelo commented 6 years ago

Hi @ShichenLiu ,

Now I'm using a shared node setting with 4 GPUs with --gpu 0,1,2,3.

The problem was that the memory still grew at the initial steps of training using a batch size of 256 images, exceeding the 128GB of RAM available in the machines I am working with. When reducing batch size to 128 and 64 it also failed at posterior iterations within the same initial epoch, respectively.

The amount of samples in my dataset is, however, much lower than the CIFAR and ImageNet, as I explained above, and I presume 128 GB of RAM should be enough to train your type of network. Therefore, I would appreciate if you think there could be some problem with the parameter setting I am using to avoid such growing memory problems on non-large datasets.

Cheers,

ShichenLiu commented 6 years ago

Hi @vponcelo ,

In fact the size of dataset should not affect the memory consumption. Instead, only batch size may cause the OOM problem. Could you please provide me with the network architecture printed right after running python main.py --model condensenet -b 256 -j 20 /PATH/TO/DATASET --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0,1,2,3 --resume ? I believe that could help us solve the problem easier.

Thanks

lvdmaaten commented 6 years ago

@vponcelo: You presumably had to make some changes in the data-loading to facilitate training on a different data set. Can you please double-check that you're not keeping all the data in RAM? (That is, that you're not doing, say, torch.load or pickle.load on a 200GB data file.)

Presuming the above is not the case, this may be an issue where the Python garbage collector does not get invoked for some reason? You could try and add an explicit call to the garbage collection here (add gc.collect()) to see if that helps.

vponcelo commented 6 years ago

Thanks a lot for your very fast and useful answers @ShichenLiu and @lvdmaaten.

I went back to the first original OOM problem that was taking place here (before the invocation of the Python garbage collector may take place). In my case, the values of input_var were [torch.FloatTensor of size <batchsize>x3x224x224], trying several values for the <batchsize> (16, 32, 64, 128, 256) where all of them raised the OOM problem. I was wondering whether I should crop the images to be 224x224 rather than using the original BBs sized to 256x256, as you did on your paper, but it didn't seem to be the problem at all. Then I tried to figure out any data-loading problem. However, I realized that the data-loading of very big files only takes place when loading a model -it could take place also when loading very large images, but this is not my case-.

After trying several settings, I found that another solution was as simple as to set the parameter for the number of workers up to a maximum of 26 workers with -j 26, even that the machine I am using have 28 workers. It seems that two of them are not available, reserved for system management purposes, or that Python gave some troubles when looking for the whole number of workers. Indeed, I am not sure why this was raising a problem related to exceeding memory limits.

Therefore, the OOM problem is now solved anyway. I am attaching the output of my trained network, so that you can see whether it looks fine before closing this issue. It seems it is working pretty good in my dataset from the starting epochs, isn't it?

Cheers and thank you both again,

output.log

ShichenLiu commented 6 years ago

Great to know that your problem has been solved and the log looks good to me. :-)

ShichenLiu / CondenseNet

Out of memory issue when training a new dataset #5