Add MoS (Mixture of Softmax) option for next-char prediction

http://smerity.com/articles/2017/mixture_of_softmaxes.html

I implemented it. Does not noticeably help for BPC or for sentiment transfer on 1024 RNN hidden size. But we should share the code anyway. Perhaps MoS would be more helpful for a larger output softmax, like word parts [characters are only softmax size 64]. Might as well have that in there.

It does increase memory usage especially for large hidden states.

NVIDIA / sentiment-discovery

Add MoS (Mixture of Softmax) option for next-char prediction #11