I compared the time it took to train the models using 2 GPUs vs. using 1 GPU, an the result was that the scaleup of training with 2 GPUs is far from 2x. In fact, the scaleup of training with 2 GPUs and a batch size of 2 is 1.17x, and 1.345x with 8 as batch size. What is happening? What is wrong?
I have looked the messages displayed after every iteration, an although "data" time does not vary with respect to the single GPU case, the "time" time is at least twice bigger in the 2 GPUs case.
"data" time: The time it takes to load the data.
"time" time: The time it take to do a whole iteration, including loading the data, forward and backward props.
Disclaimer: These confusing terms are the ones uses in the code.
The comparisons have been made using the same hardware configurations.
The main problem: As the image size is randomly chosen, GPUs don't necessarily have the same image size during each iteration. Thus, GPUs have to wait for the slowest one (the one with the biggest image) every iteration, resulting in some of the GPUs being idle while waiting, hindering scaleup thereby.
The code makes use of DataParallel to implement data parallelism. However, it is no longer advisable to use this module (according to Pytorch official documentation). Instead, Distributed Data Parallel is said to be more efficient. Indeed, I have checked it with 2 GPU and synchronized batch normalization. Anyway, it stands to reason that this code is implemented using the former, since it is old to some extent.
The time measures used in the code ("data" and "time" time) might be useless: Most of cuda operations are unsynchronized, so this not the way forward to do time profiling.
I compared the time it took to train the models using 2 GPUs vs. using 1 GPU, an the result was that the scaleup of training with 2 GPUs is far from 2x. In fact, the scaleup of training with 2 GPUs and a batch size of 2 is 1.17x, and 1.345x with 8 as batch size. What is happening? What is wrong?
I have looked the messages displayed after every iteration, an although "data" time does not vary with respect to the single GPU case, the "time" time is at least twice bigger in the 2 GPUs case.
The comparisons have been made using the same hardware configurations.