I have been reviewing how the Gumbel-Softmax[1] trick was used and both the paper and the code suggest that the "relevance scores are interpreted as log probabilities"[2] but how come the output of a convolutional layer is interpreted as being a strictly negative quantity? (This is unlikely to break training but silently yield suboptimal performance due to inaccurate approximate sampling from the discrete distribution)
Please let me know, maybe there is a subtle intuition or training dynamic at play here that I am missing. Thanks!
Thanks for releasing the code!
I have been reviewing how the Gumbel-Softmax[1] trick was used and both the paper and the code suggest that the "relevance scores are interpreted as log probabilities"[2] but how come the output of a convolutional layer is interpreted as being a strictly negative quantity? (This is unlikely to break training but silently yield suboptimal performance due to inaccurate approximate sampling from the discrete distribution)
Please let me know, maybe there is a subtle intuition or training dynamic at play here that I am missing. Thanks!
[1] https://arxiv.org/pdf/1611.01144.pdf (Equation 1) [2] https://arxiv.org/pdf/1711.11503.pdf (Section 3.3, page 5)