multi-gpu training - Githubissues

Owen-Liuyuxuan / visualDet3D

Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving / YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection

https://owen-liuyuxuan.github.io/papers_reading_sharing.github.io/3dDetection/GroundAwareConvultion/

Apache License 2.0

373 stars 77 forks source link

multi-gpu training #6

Open slinghe0321 opened 3 years ago

slinghe0321 commented 3 years ago

Hi, thanks for your great work! I have trained GroundAwareYolo3D model and get results as below: Car AP(Average Precision)@0.70, 0.70, 0.70 bbox AP: 97.29, 84.55, 64.65 bev AP: 29.53, 20.15, 15.53 3d AP: 22.90, 15.26, 11.33 aos AP: 96.52, 82.52, 63.05

seems comparable with paper report (23.63 16.16 12.06) in Car AP@0.70 validation set.

However if training with multi-gpu e.g. 4-GPU, we get poor result as below: Car AP(Average Precision)@0.70, 0.70, 0.70 bbox AP: 97.08, 86.41, 66.67 bev AP: 20.56, 15.16, 11.22 3d AP: 15.17, 10.81, 8.22 aos AP: 95.50, 83.36, 64.24

training command: bash ./launchers/train.sh config/$CONFIG_FILE.py 0,1,2,3 multi-gpu-train bash ./launchers/train.sh config/$CONFIG_FILE.py 0 single-gpu-train

I trained twice with 'multi-gpu' and both results are similar and lower than 'single-gpu', so do you have some suggestions about this case? What about your multi-gpu training performance?

Owen-Liuyuxuan commented 3 years ago

I also notice this. I consider this a bug.

I guess the problem is that multi-GPU training changes the relative weights between batches (batches on different GPUs are simply averaged while batches on the same GPU weight depending on num_gt, and some batches are skipped).

I have not tested to debug this, because I am not that familiar with APIs on multi-GPUs training.

Owen-Liuyuxuan commented 3 years ago

I changed

weighted_regression_losses = torch.sum(weights * reg_loss / (torch.sum(weights) + 1e-6), dim=0)

into

weight_sum = torch.sum(weights)
if torch.distributed.is_initialized():
    N = torch.distributed.get_world_size()
    torch.distributed.all_reduce(weight_sum)
    reg_loss = reg_loss * N
weighted_regression_losses = torch.sum(weights * reg_loss / (weight_sum + 1e-6), dim=0)

and half the batch size, Empirically, the gap gets smaller, but the gap still exists

cnexah commented 3 years ago

请问multi-gpu会对mono_depth的训练产生影响吗？

Owen-Liuyuxuan commented 3 years ago

请问multi-gpu会对mono_depth的训练产生影响吗？

In my test, depth prediction is fine with multi-gpu

Owen-Liuyuxuan commented 2 years ago

For now, in the new update, with the distributed sampler from detectron2, we are able to train with multi-GPU and obtain reasonable performance.

Without tuning the learning rate and batch size, the result goes like this:

Car AP(Average Precision)@0.70, 0.70, 0.70:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:29.68, 20.48, 15.73                                                                                                                                                                                        
3d   AP:21.56, 15.00, 11.16                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92                                                                                                                                                                                        
Car AP(Average Precision)@0.70, 0.50, 0.50:                                                                                                                                                                        
bbox AP:97.24, 86.90, 67.03                                                                                                                                                                                        
bev  AP:65.20, 46.35, 35.98                                                                                                                                                                                        
3d   AP:58.84, 41.06, 32.49                                                                                                                                                                                        
aos  AP:96.23, 84.25, 64.92