diux-dev / cluster

train on AWS
75 stars 15 forks source link

ImageNet: changing batch-size affects point when lr switches #41

Closed yaroslavvb closed 6 years ago

yaroslavvb commented 6 years ago

This is either unwanted interaction, or measurement side-effect (right now x-axis doesn't properly count incomplete batches, this line https://github.com/diux-dev/cluster/blob/13fb813d4a3a96a6c27bda1d5bee979d0d3d125e/pytorch/training/train_imagenet_nv.py#L466) . Ideally it should be possible to change batch size without affecting other parameters like schedule

cc @bearpelican

tb link

screenshot 2018-08-01 19 30 05
bearpelican commented 6 years ago

Oh this drop in learning rate is actually intentional. For some reason I found that when setting batch norm weights to 0, we get better results scaling to 2x learning rate before dropping back down. https://github.com/diux-dev/cluster/blob/master/pytorch/training/train_imagenet_nv.py#L187

Will need to do more tests to understand why this is so. But to do the traditional warmup, with no drop, disable the '--init-bn0' flag