Closed purzelrakete closed 3 years ago
bin/train maestro -p batch_size 12 -p batch_norm True -p learning_rate 0.05 -p max_epochs 12 -p sample_overlap_receptive_field True
Due to averaging of gradients across ddp workers, we have to be careful that we have not effectively halved the learning rate. But this should not be the case, since xent has an average reduction by default. Nevertheless, training with a doubled learning rate:
bin/train maestro -p batch_size 12 -p batch_norm True -p learning_rate 0.1 -p max_epochs 12 -p sample_overlap_receptive_field True
What
Use nn.DistributedDataParallel instead of nn.DataParallel.
Why
The immediate motivation is to use SynchBatchNorm, which is only available for DDP. However DDP is also the recommended data parallel method in pytorch. As it turns out, the machinery which has to be added is pretty minimal when using fork process.
Acceptance Criteria