Correct attention layer implementation

ajbrock / BigGAN-PyTorch

The author's officially unofficial PyTorch BigGAN implementation.

MIT License

2.84k stars 470 forks source link

Correct attention layer implementation #88

Closed valillon closed 3 years ago

valillon commented 3 years ago

There is an interesting comment regarding the original attention layer proposed in SAGAN. Actually, the published method and the actual implementation significantly differ as issued in the official repo and another well-known pytorch implementation. Therein, I recently proposed an implementation which strictly follows the original paper description, at least to my understanding of it.

So, with not other explanation, which is then the correct method, the one published or the implemented one?

ajbrock commented 3 years ago

The SAGAN source code is the source of truth; where it differs from the paper, the source code is correct.

valillon commented 3 years ago

So it be. Any reason for those max_pool2d() inside the attention layer, apart from saving computation?

ajbrock commented 3 years ago

Design-wise, yep, it's just for compute. You can downsample the keys and values without much loss in quality but while improving compute/memory overhead.

valillon commented 3 years ago

Good to know. Thanks for the hint Andy!