change DP to DDP; fix bug about test and test_embed input keywords

I have changed DP to DPP, which leads to about 18% speedup during training when using a single node with multi GPUs.
(Note that the training script should be also changed as the following: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch train.py \ --batch-size 32 \ --cfg cfg/yolov3_1088x608.cfg )
Besides, during validation, there are mismatches of function parameters of test and test_embed between the function definition and calling (see issue #147 ).

Zhongdao / Towards-Realtime-MOT