I have changed DP to DPP, which leads to about 18% speedup during training when using a single node with multi GPUs.
(Note that the training script should be also changed as the following:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch train.py \
--batch-size 32 \
--cfg cfg/yolov3_1088x608.cfg
)
Besides, during validation, there are mismatches of function parameters of test and test_embed between the function definition and calling (see issue #147 ).
(Note that the training script should be also changed as the following: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch train.py \ --batch-size 32 \ --cfg cfg/yolov3_1088x608.cfg )
test
andtest_embed
between the function definition and calling (see issue #147 ).