I implemented it. Does not noticeably help for BPC or for sentiment transfer on 1024 RNN hidden size. But we should share the code anyway. Perhaps MoS would be more helpful for a larger output softmax, like word parts [characters are only softmax size 64]. Might as well have that in there.
It does increase memory usage especially for large hidden states.
http://smerity.com/articles/2017/mixture_of_softmaxes.html
I implemented it. Does not noticeably help for BPC or for sentiment transfer on 1024 RNN hidden size. But we should share the code anyway. Perhaps MoS would be more helpful for a larger output softmax, like word parts [characters are only softmax size 64]. Might as well have that in there.
It does increase memory usage especially for large hidden states.