How to do distributed training on the multiple machines?

Megvii-BaseDetection / BEVDepth

Official code for BEVDepth.

MIT License

730 stars 101 forks source link

How to do distributed training on the multiple machines? #108

Open foolhard opened 1 year ago

foolhard commented 1 year ago

Hello,

Thanks for your greate job and open source the code.

I test your code in 4-GPU server the result is 0.3 lower than yours. I want to train the model on 2 servers with 4-gpu each to see whether can align with your results.

Does this repo support distributed training on multiple nodes？ If yes, how to?

Best regards.

pummi823 commented 1 year ago

Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes~~

yinchimaoliang commented 1 year ago

Hello,

Thanks for your greate job and open source the code.

I test your code in 4-GPU server the result is 0.3 lower than yours. I want to train the model on 2 servers with 4-gpu each to see whether can align with your results.

Does this repo support distributed training on multiple nodes？ If yes, how to?

Best regards.

Hi there, right now, our codebase doesn't support multi-machine training and we are working on it. If you have any clue, you're welcomed to contribute to this codebase.

yinchimaoliang commented 1 year ago

Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes~~

Hi, the result my jitter each time. The ckpt I provided is related to the result I reported. However, after the reconstructing of this codebase, this ckpt is unable to be used, I will fix this soon. BTW, if you switch to older versoion, you can reproduce the result.

pummi823 commented 1 year ago

Hi, but in #20 ", you have said: "The batch size for each card is set to 2, and we use 4 machines. Learning rate is 2e-4." This does not mean multi-machine training??

yinchimaoliang commented 1 year ago

Hi, but in #20 ", you have said: "The batch size for each card is set to 2, and we use 4 machines. Learning rate is 2e-4." This does not mean multi-machine training??

Yes, all the results we claimed in paper was conducted on our internal codebase which we didn't release. In our internal codebase, we support multi-machine training. The structure of BEVDepth is the same between the internal codebase and the open-sourcing codebase. We have tried our best effort to reduce the gap between these to codebases, there is still some differences between them.

foolhard commented 1 year ago

Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes~~

Thanks for your feedback.

I retrain the model with 4 GPU and found the result is lower than your model:

In basic BEVDepth (w/o tricks): NDS = 0.4237, mAP = 0.3188
In BEVDepth with CBGS, no EMA: NDS= 0.4724, mAP = 0.3434

Is this result jitter too big?

yinchimaoliang commented 1 year ago

Yes, a little too big, have you set the right batch size and learning rate?

foolhard commented 1 year ago

Actually I just change the gpu number in training script., but don't change the learning rate explicitly.

I assume you calculate learning rate in your code based on the gpu number and batch_size_per_device. lr = self.basic_lr_per_img * \ self.batch_size_per_device * self.gpus

The train script I used: python [EXP_PATH] --amp_backend native -b 8 --gpus 4

Anything wrong?