Closed valillon closed 3 years ago
The SAGAN source code is the source of truth; where it differs from the paper, the source code is correct.
So it be.
Any reason for those max_pool2d()
inside the attention layer, apart from saving computation?
Design-wise, yep, it's just for compute. You can downsample the keys and values without much loss in quality but while improving compute/memory overhead.
Good to know. Thanks for the hint Andy!
There is an interesting comment regarding the original attention layer proposed in SAGAN. Actually, the published method and the actual implementation significantly differ as issued in the official repo and another well-known pytorch implementation. Therein, I recently proposed an implementation which strictly follows the original paper description, at least to my understanding of it.
So, with not other explanation, which is then the correct method, the one published or the implemented one?