Closed peterzpy closed 3 years ago
Hi @peterzpy,
Are you referring to the CUDA-native or the PyTorch in-place versions?
I mean the results reported in the table1 of your paper. I get no idea which version you have reported. Besides, why the baseline of ResNet50 has just 72.9% in term of accuracy? The standard ResNet50 should have got an accuracy of about 76%+ on ImageNet. Looking forward to your reply, thx!
The speeds for sum and average pooling are based on the current Tensorflow/PyTorch implementations available. While there are native implementations for average pooling in both libraries, there aren't any for sum pooling. Therefore, we created our own. In general, we considered two different approaches:
conv_xd
with kernel values of 1.avg_pool
and then performing in-place multiplication by the kernel number (similar to how we did for SoftPool on the non-CUDA-native version).On average, we found that the time-related tests favoured Option 2 better, agreeing with a previously related StackOverflow question [link].
In terms of the small accuracy difference between our base models and the ones reported in the literature (for example as of PyTorch's website 76.15% for ResNet50), that is due to our experiment environment and conditions ( mainly the batch size as we used a single machine with 4x2080Tis). The training script is also exactly the same as the one used by PyTorch with only some very minor modifications [link].
Best, Alex
Thanks for your reply. As you said, you half the standard batch size due to the limitation of the GPU memory. But have you reduced the learning rate correspondingly? This method may reduce the performance drop due to batch size change.
We have also tested using a learning rate of 1e-2 as the starting point with overall worse accuracy performance. I believe that such changes in performance based on batch sizes are more attributed to greater problems (e.g. no free lunch theorem) as having mini-batches w/ decreased sizes that are balanced/representational for the target classes is especially challenging for cases as such (1K classes).
Thanks again, I have no more questions now and you can close this issue. : )
I found an interesting result in your paper. The speed of sum operation is slower compared with the average operation by a huge margin. Could you please explain the reason for this phenomenon? Thank you!