Open foolhard opened 1 year ago
Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes~~
Hello,
Thanks for your greate job and open source the code.
I test your code in 4-GPU server the result is 0.3 lower than yours. I want to train the model on 2 servers with 4-gpu each to see whether can align with your results.
Does this repo support distributed training on multiple nodes? If yes, how to?
Best regards.
Hi there, right now, our codebase doesn't support multi-machine training and we are working on it. If you have any clue, you're welcomed to contribute to this codebase.
Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes
~~
Hi, the result my jitter each time. The ckpt I provided is related to the result I reported. However, after the reconstructing of this codebase, this ckpt is unable to be used, I will fix this soon. BTW, if you switch to older versoion, you can reproduce the result.
Hi, but in #20 ", you have said: "The batch size for each card is set to 2, and we use 4 machines. Learning rate is 2e-4." This does not mean multi-machine training??
Hi, but in #20 ", you have said: "The batch size for each card is set to 2, and we use 4 machines. Learning rate is 2e-4." This does not mean multi-machine training??
Yes, all the results we claimed in paper was conducted on our internal codebase which we didn't release. In our internal codebase, we support multi-machine training. The structure of BEVDepth is the same between the internal codebase and the open-sourcing codebase. We have tried our best effort to reduce the gap between these to codebases, there is still some differences between them.
Hi, I want to know the "0.3 lower" result, is from you directly use the checkpoint the author provided, or you retrain the model by youself? Best wishes
~~
Thanks for your feedback.
I retrain the model with 4 GPU and found the result is lower than your model:
Is this result jitter too big?
Yes, a little too big, have you set the right batch size and learning rate?
Actually I just change the gpu number in training script., but don't change the learning rate explicitly.
I assume you calculate learning rate in your code based on the gpu number and batch_size_per_device.
lr = self.basic_lr_per_img * \ self.batch_size_per_device * self.gpus
The train script I used:
python [EXP_PATH] --amp_backend native -b 8 --gpus 4
Anything wrong?
Hello,
Thanks for your greate job and open source the code.
I test your code in 4-GPU server the result is 0.3 lower than yours. I want to train the model on 2 servers with 4-gpu each to see whether can align with your results.
Does this repo support distributed training on multiple nodes? If yes, how to?
Best regards.