CSAILVision / semantic-segmentation-pytorch

Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset
http://sceneparsing.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
4.95k stars 1.1k forks source link

Scaleup of training with 2 GPUs is far from 2x #294

Closed ivanlado closed 6 days ago

ivanlado commented 1 month ago

I compared the time it took to train the models using 2 GPUs vs. using 1 GPU, an the result was that the scaleup of training with 2 GPUs is far from 2x. In fact, the scaleup of training with 2 GPUs and a batch size of 2 is 1.17x, and 1.345x with 8 as batch size. What is happening? What is wrong?

I have looked the messages displayed after every iteration, an although "data" time does not vary with respect to the single GPU case, the "time" time is at least twice bigger in the 2 GPUs case.

The comparisons have been made using the same hardware configurations.

ivanlado commented 6 days ago

I have solved this efficiency problem.

  1. The main problem: As the image size is randomly chosen, GPUs don't necessarily have the same image size during each iteration. Thus, GPUs have to wait for the slowest one (the one with the biggest image) every iteration, resulting in some of the GPUs being idle while waiting, hindering scaleup thereby.
  2. The code makes use of DataParallel to implement data parallelism. However, it is no longer advisable to use this module (according to Pytorch official documentation). Instead, Distributed Data Parallel is said to be more efficient. Indeed, I have checked it with 2 GPU and synchronized batch normalization. Anyway, it stands to reason that this code is implemented using the former, since it is old to some extent.
  3. The time measures used in the code ("data" and "time" time) might be useless: Most of cuda operations are unsynchronized, so this not the way forward to do time profiling.