memory consumption too large

lqycrystal commented 5 years ago

When I trained model, the memory always gets even larger with the epochs, and then out of memory, even I only use 200 images. Does anyone have solutions?

dagongji10 commented 5 years ago

When I trained model, the memory always gets even larger with the epochs, and then out of memory, even I only use 200 images. Does anyone have solutions?

Hello, I have got the same problem too. Have you solve it now? Beg your help.

NoamRosenberg commented 5 years ago

Same problem here, it has something to do with the ASPP layer. @dagongji10 check the global avg pooling and upsample layers are defined in the forward function. This might be it?

NoamRosenberg commented 5 years ago

OKay, looks like I found it, I've been training for several minutes and the memory looks steady..

add self.level_2 = [] self.level_4 = [] self.level_8 = [] self.level_16 = [] self.level_32 = [] to the forward function in auto_deeplab.py

Otherwise the levels lists keep appending parameters

dagongji10 commented 5 years ago

@NoamRosenberg Have you chang anything else? I followed just as you say but it seems no use. How much memory the model cost during training?

NoamRosenberg commented 5 years ago

@dagongji10 I also took out the ASPP layer, but on its own it didn't change anything. It's possible the problem is with both. I check now.

NoamRosenberg commented 5 years ago

@dagongji10 runs fine with the ASPP layer, though I'm not sure if it's a good idea to define the upsample function and global averaging in the forward functions

dagongji10 commented 5 years ago

@NoamRosenberg Could you have a look on the memory cost? Maybe I need to make it run first.

lqycrystal commented 5 years ago

@NoamRosenberg Could you have a look on the memory cost? Maybe I need to make it run first.

Hi, I just move these belowing lines in auto_deeplab.py into the forward function self.level_2 = [] self.level_4 = [] self.level_8 = [] self.level_16 = [] self.level_32 = []

and the memory seems steady, when I trained 200 images, it only cost 7.7G

Charlie4zc commented 5 years ago

Traceback (most recent call last): File "train_autodeeplab.py", line 318, in <module> main() File "train_autodeeplab.py", line 311, in main trainer.training(epoch) File "train_autodeeplab.py", line 110, in training self.architect.step (image_search, target_search) File "/home/XX/workspace/ADL/architect.py", line 16, in step self._backward_step(input_valid, target_valid) File "/home/XX/workspace/ADL/architect.py", line 20, in _backward_step loss = self.model._loss (input_valid, target_valid) File "/home/XX/workspace/ADL/auto_deeplab.py", line 568, in _loss logits = self (input) File "/home/XX/.conda/envs/ADL/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/XX/workspace/ADL/auto_deeplab.py", line 324, in forward concate_feature_map = torch.cat ([aspp_result_4, aspp_result_8, aspp_result_16, aspp_result_32], 1) RuntimeError: CUDA error: out of memory

Changed code as aforementioned, still Multiple-gpu doesnt work as well BTW it's on pascal dataset Namespace(arch_lr=None, arch_weight_decay=0.001, backbone='resnet', base_size=224, batch_size=4, checkname='deeplab-resnet', crop_size=224, cuda=True, dataset='pascal', epochs=50, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=[0], loss_type='ce', lr=0.007, lr_scheduler='poly', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resume=None, seed=1, start_epoch=0, sync_bn=False, test_batch_size=4, use_balanced_weights=False, use_sbd=False, weight_decay=0.0003, workers=4)(may not be accurate due to messy code) reduce batch_size and workers to 2 and 1 respectively

wb-finalking commented 5 years ago

@NoamRosenberg Could you have a look on the memory cost? Maybe I need to make it run first.

Hi, I just move these belowing lines in auto_deeplab.py into the forward function self.level_2 = [] self.level_4 = [] self.level_8 = [] self.level_16 = [] self.level_32 = []

and the memory seems steady, when I trained 200 images, it only cost 7.7G

Hi, it cost about 8G GPU memory under batch_size 1 and input_size 64 with pytorch 1.0. If there is something wrong or some configuration different with yours. I don't change the given parameters.

sorrowyn commented 4 years ago

CUDA_VISIBLE_DEVICES=0,1,2 python train_autodeeplab.py --backbone resnet --lr 0.007 --workers 2 --epochs 40 --batch_size 1 --gpu_ids 0,1,2 --eval_interval 1 --base_size 64 --crop_size 64 --dataset cityscapes RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.89 GiB already allocated; 10.62 MiB free; 9.98 GiB reserved in total by PyTorch)

tuanhui-li commented 4 years ago

CUDA_VISIBLE_DEVICES=0,1,2 python train_autodeeplab.py --backbone resnet --lr 0.007 --workers 2 --epochs 40 --batch_size 1 --gpu_ids 0,1,2 --eval_interval 1 --base_size 64 --crop_size 64 --dataset cityscapes RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.89 GiB already allocated; 10.62 MiB free; 9.98 GiB reserved in total by PyTorch)

Hi ! Have you ever got the problem as follows:

Traceback (most recent call last):
  File "train_autodeeplab.py", line 301, in <module>
    main()
  File "train_autodeeplab.py", line 294, in main
    trainer.training(epoch)
  File "train_autodeeplab.py", line 110, in training
    output = self.model(image)
  File "/home/lth/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lth/myProject/segmentation/AutoDeeplab-master/auto_deeplab.py", line 167, in forward
    level4_new = self.cells[count] (None, self.level_4[-1], weight_cells)
  File "/home/lth/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lth/myProject/segmentation/AutoDeeplab-master/model_search.py", line 63, in forward
    s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states) if h is not None)
  File "/home/lth/myProject/segmentation/AutoDeeplab-master/model_search.py", line 63, in <genexpr>
    s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states) if h is not None)
TypeError: 'NoneType' object is not callable

I tried to run this code, but which just got a bug !

davidhuangal commented 4 years ago

Has anyone solved this? It looks like those self.level_2 = [] self.level_4 = [] self.level_8 = [] self.level_16 = [] self.level_32 = []

are already there and it still doesn't work.

NdaAzr commented 4 years ago

Has anyone solved this? this is already exist in the forward but it does not work! self.level_2 = [] self.level_4 = [] self.level_8 = [] self.level_16 = [] self.level_32 = []

MenghaoGuo / AutoDeeplab

memory consumption too large #14