Closed dkumazaw closed 5 years ago
@dkumazaw
Hi, Just git pull
to update your repo.
I found I push a slightly wrong version to the repo. I have roll back it and you should have no OOM bugs now.
I test it on 16GB card and as :
→ python train_search.py --batchsz 82 --gpu 1
Experiment dir : exp99
Total GPU mem: 16278 used: 2
allocated mem: 14811.0
reuse mem now ...
01/24 03:00:50 PM GPU device = 1
01/24 03:00:50 PM args = Namespace(arch_lr=0.0003, arch_wd=0.001, batchsz=82, cutout=False, cutout_len=16, data='../data', drop_path_prob=0.3, epochs=50, exp_path='exp99', gpu=1, grad_clip=5, init_ch=16, layers=8, lr=0.025, lr_min=0.001, model_path='saved_models', momentum=0.9, report_freq=50, seed=2, train_portion=0.5, unrolled=True, wd=0.0003)
01/24 03:00:50 PM Total param size = 1.930842 MB
Files already downloaded and verified
01/24 03:00:51 PM
Epoch: 0 lr: 2.500000e-02
01/24 03:00:51 PM Genotype: Genotype(normal=[('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 2), ('max_pool_3x3', 1), ('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 1), ('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('dil_conv_5x5', 2), ('sep_conv_3x3', 2), ('avg_pool_3x3', 3), ('max_pool_3x3', 4), ('dil_conv_5x5', 0)], reduce_concat=range(2, 6))
01/24 03:01:00 PM Step:000 loss:2.356416 acc1:7.317073 acc5:53.658539
For safety, i also test it on 8GB card as :
o@m:~/arc/automl/DARTS-PyTorch/cnn$ python train_search.py --batchsz 16
Experiment dir : exp99
Total GPU mem: 4040 used: 772
01/24 03:16:57 PM GPU device = 0
01/24 03:16:57 PM args = Namespace(arch_lr=0.0003, arch_wd=0.001, batchsz=16, cutout=False, cutout_len=16, data='../data', drop_path_prob=0.3, epochs=50, exp_path='exp99', gpu=0, grad_clip=5, init_ch=16, layers=8, lr=0.025, lr_min=0.001, model_path='saved_models', momentum=0.9, report_freq=50, seed=2, train_portion=0.5, unrolled=True, wd=0.0003)
01/24 03:16:59 PM Total param size = 1.930842 MB
Files already downloaded and verified
01/24 03:17:00 PM
Epoch: 0 lr: 2.500000e-02
01/24 03:17:00 PM Genotype: Genotype(normal=[('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('dil_conv_3x3', 1), ('dil_conv_5x5', 2), ('max_pool_3x3', 1), ('avg_pool_3x3', 0), ('dil_conv_5x5', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('avg_pool_3x3', 1), ('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('dil_conv_5x5', 2), ('sep_conv_3x3', 2), ('avg_pool_3x3', 3), ('max_pool_3x3', 4), ('dil_conv_5x5', 0)], reduce_concat=range(2, 6))
01/24 03:17:08 PM Step:000 loss:2.307790 acc1:18.750000 acc5:62.500000
Let me know if it works for you.
Hi, thanks for the reply! Yeah I was able to run your code and here's what the memory consumption looks like:
python train_search.py --batchsz=64
Compared to when running the official implementation with the same batch size, the consumption is about 1GB larger. This is acually what I am observing as well in my own version of implementation, so I'm just wondering what's causing this difference...
@dkumazaw it's good you can run it now.
So do you figure out why my version can work? (1gb larger memory wont be critical issue for me currently~~)
Yes yes, it works fine so I will close the issue. I just wonder what is causing the difference (maybe some subtle differences in different versions... I will definitely let you know when I figure it out.) Thanks again!
Thanks again for nice migration! I've tried running your code in my environment, but it seems like I get an OOM even when I run train_search with smaller batch size. It looks like the memory consumption sees a big spike at the beginning and starts to settle into smaller usage later on...
e.g.)
then
I'm not sure what could be causing this issue