IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.95k stars 204 forks source link

Increase in Batch Size slows down runtime signficantly #227

Closed FabianSchuetze closed 1 year ago

FabianSchuetze commented 1 year ago

Hi,

thanks for the wonderful repo! It's a pleasure to read the code and to work with it.

The runtime of the training seems to increase dramatically when I increase the batch size, despite still having a good GPU utilization and the data_time just a small fraction of the time.

For example, with a batch size of 4 (with maskdino) is see:

[03/07 18:46:09 d2.utils.events]:  eta: 5 days, 15:55:47  iter: 59, ...  time: 1.3493  data_time: 0.0196 

Whilst with a batch size of 2 I see:

[03/07 18:43:33 d2.utils.events]:  eta: 3 days, 2:58:34  iter: 59, ... time: 0.7281  data_time: 0.0101 

My understanding would be that if time double but the batch size also double, the runtime should be identical. I am a bit surprised that time double despite the data_time being still so low (dataloader does not seem to be the bottleneck) and the GPU memory still be available and utilization is also not constantly maxed out. This is running on a single GPU with amp activated.

Why could that be? Can I do something about it?

rentainhe commented 1 year ago

Hello! Thanks for your issue to point out this problem, I was wondering, do you refine the max_iter with a larger batch-size

To the best of my knowledge, detectron2 only support iter-based training now, so if you are using a larger training batch-size, you should set a small max_iter in your config, an example:

total_batch_size = 16
max_iter = 90000

total_batch_size = 32
max_iter = 45000
FabianSchuetze commented 1 year ago

Thanks for the kind & informative reply!

Your answer makes a lot of sense - thanks.

rentainhe commented 1 year ago

Thanks for the kind & informative reply!

Your answer makes a lot of sense - thanks.

You're welcome~ I also believe that using epoch-based trainer may not be that confused~ we will try to update a epoch-based trainer in detrex in the future version