The speed between sum and average

peterzpy commented 3 years ago

I found an interesting result in your paper. The speed of sum operation is slower compared with the average operation by a huge margin. Could you please explain the reason for this phenomenon? Thank you!

alexandrosstergiou commented 3 years ago

Hi @peterzpy,

Are you referring to the CUDA-native or the PyTorch in-place versions?

peterzpy commented 3 years ago

I mean the results reported in the table1 of your paper. I get no idea which version you have reported. Besides, why the baseline of ResNet50 has just 72.9% in term of accuracy? The standard ResNet50 should have got an accuracy of about 76%+ on ImageNet. Looking forward to your reply, thx!

alexandrosstergiou commented 3 years ago

The speeds for sum and average pooling are based on the current Tensorflow/PyTorch implementations available. While there are native implementations for average pooling in both libraries, there aren't any for sum pooling. Therefore, we created our own. In general, we considered two different approaches:

Option 1 was to use conv_xd with kernel values of 1.
Option 2 was to use avg_pool and then performing in-place multiplication by the kernel number (similar to how we did for SoftPool on the non-CUDA-native version).

On average, we found that the time-related tests favoured Option 2 better, agreeing with a previously related StackOverflow question [link].

In terms of the small accuracy difference between our base models and the ones reported in the literature (for example as of PyTorch's website 76.15% for ResNet50), that is due to our experiment environment and conditions ( mainly the batch size as we used a single machine with 4x2080Tis). The training script is also exactly the same as the one used by PyTorch with only some very minor modifications [link].

Best, Alex

peterzpy commented 3 years ago

Thanks for your reply. As you said, you half the standard batch size due to the limitation of the GPU memory. But have you reduced the learning rate correspondingly? This method may reduce the performance drop due to batch size change.

alexandrosstergiou commented 3 years ago

We have also tested using a learning rate of 1e-2 as the starting point with overall worse accuracy performance. I believe that such changes in performance based on batch sizes are more attributed to greater problems (e.g. no free lunch theorem) as having mini-batches w/ decreased sizes that are balanced/representational for the target classes is especially challenging for cases as such (1K classes).

peterzpy commented 3 years ago

Thanks again, I have no more questions now and you can close this issue. : )

alexandrosstergiou / SoftPool

The speed between sum and average #5