Memory issue even when running with small batch size

dkumazaw commented 5 years ago

Thanks again for nice migration! I've tried running your code in my environment, but it seems like I get an OOM even when I run train_search with smaller batch size. It looks like the memory consumption sees a big spike at the beginning and starts to settle into smaller usage later on...

e.g.)

python train_search.py --batchsz=16

then

...
(abbreviated)
...
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.90 GiB total capacity; 858.54 MiB already alloc│
ated; 1.88 MiB free; 14.39 GiB cached)

I'm not sure what could be causing this issue

dragen1860 commented 5 years ago

@dkumazaw

Hi, Just git pull to update your repo.

I found I push a slightly wrong version to the repo. I have roll back it and you should have no OOM bugs now.

I test it on 16GB card and as :

→ python train_search.py --batchsz 82 --gpu 1
Experiment dir : exp99
Total GPU mem: 16278 used: 2
allocated mem: 14811.0
reuse mem now ...
01/24 03:00:50 PM GPU device = 1
01/24 03:00:50 PM args = Namespace(arch_lr=0.0003, arch_wd=0.001, batchsz=82, cutout=False, cutout_len=16, data='../data', drop_path_prob=0.3, epochs=50, exp_path='exp99', gpu=1, grad_clip=5, init_ch=16, layers=8, lr=0.025, lr_min=0.001, model_path='saved_models', momentum=0.9, report_freq=50, seed=2, train_portion=0.5, unrolled=True, wd=0.0003)
01/24 03:00:50 PM Total param size = 1.930842 MB
Files already downloaded and verified
01/24 03:00:51 PM 
Epoch: 0 lr: 2.500000e-02
01/24 03:00:51 PM Genotype: Genotype(normal=[('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 2), ('max_pool_3x3', 1), ('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 1), ('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('dil_conv_5x5', 2), ('sep_conv_3x3', 2), ('avg_pool_3x3', 3), ('max_pool_3x3', 4), ('dil_conv_5x5', 0)], reduce_concat=range(2, 6))
01/24 03:01:00 PM Step:000 loss:2.356416 acc1:7.317073 acc5:53.658539

For safety, i also test it on 8GB card as :

o@m:~/arc/automl/DARTS-PyTorch/cnn$ python train_search.py --batchsz 16
Experiment dir : exp99
Total GPU mem: 4040 used: 772
01/24 03:16:57 PM GPU device = 0
01/24 03:16:57 PM args = Namespace(arch_lr=0.0003, arch_wd=0.001, batchsz=16, cutout=False, cutout_len=16, data='../data', drop_path_prob=0.3, epochs=50, exp_path='exp99', gpu=0, grad_clip=5, init_ch=16, layers=8, lr=0.025, lr_min=0.001, model_path='saved_models', momentum=0.9, report_freq=50, seed=2, train_portion=0.5, unrolled=True, wd=0.0003)
01/24 03:16:59 PM Total param size = 1.930842 MB
Files already downloaded and verified
01/24 03:17:00 PM 
Epoch: 0 lr: 2.500000e-02
01/24 03:17:00 PM Genotype: Genotype(normal=[('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 2), ('max_pool_3x3', 1), ('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 1), ('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('dil_conv_5x5', 2), ('sep_conv_3x3', 2), ('avg_pool_3x3', 3), ('max_pool_3x3', 4), ('dil_conv_5x5', 0)], reduce_concat=range(2, 6))
01/24 03:17:08 PM Step:000 loss:2.307790 acc1:18.750000 acc5:62.500000

Let me know if it works for you.

dkumazaw commented 5 years ago

Hi, thanks for the reply! Yeah I was able to run your code and here's what the memory consumption looks like:

python train_search.py --batchsz=64

screen shot 2019-01-24 at 14 18 08

Compared to when running the official implementation with the same batch size, the consumption is about 1GB larger. This is acually what I am observing as well in my own version of implementation, so I'm just wondering what's causing this difference...

dragen1860 commented 5 years ago

@dkumazaw it's good you can run it now.

So do you figure out why my version can work? (1gb larger memory wont be critical issue for me currently~~)

dkumazaw commented 5 years ago

Yes yes, it works fine so I will close the issue. I just wonder what is causing the difference (maybe some subtle differences in different versions... I will definitely let you know when I figure it out.) Thanks again!

dragen1860 / DARTS-PyTorch

Memory issue even when running with small batch size #1