Open slinghe0321 opened 3 years ago
I also notice this. I consider this a bug.
I guess the problem is that multi-GPU training changes the relative weights between batches (batches on different GPUs are simply averaged while batches on the same GPU weight depending on num_gt, and some batches are skipped).
I have not tested to debug this, because I am not that familiar with APIs on multi-GPUs training.
I changed
weighted_regression_losses = torch.sum(weights * reg_loss / (torch.sum(weights) + 1e-6), dim=0)
into
weight_sum = torch.sum(weights)
if torch.distributed.is_initialized():
N = torch.distributed.get_world_size()
torch.distributed.all_reduce(weight_sum)
reg_loss = reg_loss * N
weighted_regression_losses = torch.sum(weights * reg_loss / (weight_sum + 1e-6), dim=0)
and half the batch size, Empirically, the gap gets smaller, but the gap still exists
请问multi-gpu会对mono_depth的训练产生影响吗?
请问multi-gpu会对mono_depth的训练产生影响吗?
In my test, depth prediction is fine with multi-gpu
For now, in the new update, with the distributed sampler from detectron2, we are able to train with multi-GPU and obtain reasonable performance.
Without tuning the learning rate and batch size, the result goes like this:
Car AP(Average Precision)@0.70, 0.70, 0.70:
bbox AP:97.24, 86.90, 67.03
bev AP:29.68, 20.48, 15.73
3d AP:21.56, 15.00, 11.16
aos AP:96.23, 84.25, 64.92
Car AP(Average Precision)@0.70, 0.50, 0.50:
bbox AP:97.24, 86.90, 67.03
bev AP:65.20, 46.35, 35.98
3d AP:58.84, 41.06, 32.49
aos AP:96.23, 84.25, 64.92
Hi, thanks for your great work! I have trained GroundAwareYolo3D model and get results as below: Car AP(Average Precision)@0.70, 0.70, 0.70 bbox AP: 97.29, 84.55, 64.65 bev AP: 29.53, 20.15, 15.53 3d AP: 22.90, 15.26, 11.33 aos AP: 96.52, 82.52, 63.05
seems comparable with paper report (23.63 16.16 12.06) in Car AP@0.70 validation set.
However if training with multi-gpu e.g. 4-GPU, we get poor result as below: Car AP(Average Precision)@0.70, 0.70, 0.70 bbox AP: 97.08, 86.41, 66.67 bev AP: 20.56, 15.16, 11.22 3d AP: 15.17, 10.81, 8.22 aos AP: 95.50, 83.36, 64.24
training command:
bash ./launchers/train.sh config/$CONFIG_FILE.py 0,1,2,3 multi-gpu-train
bash ./launchers/train.sh config/$CONFIG_FILE.py 0 single-gpu-train
I trained twice with 'multi-gpu' and both results are similar and lower than 'single-gpu', so do you have some suggestions about this case? What about your multi-gpu training performance?