Smerity / sha-rnn

Single Headed Attention RNN - "Stop thinking with your head"
1.18k stars 133 forks source link

SplitcrossEntropy #10

Open gslaller opened 4 years ago

gslaller commented 4 years ago

Can you provide any further information on the loss function you are using? Perhaps a reference to a paper?

munael commented 3 years ago

@gslaller - Seems to be from here: Efficient softmax approximation for GPUs

See: https://twitter.com/Smerity/status/1343159498081366017

The SHA-RNN paper itself only uses it as it was already part of AWD-LSTM. It's the adaptive softmax from linked FAIR paper. Almost all Facebook (FAIR) codebases use it. Essentially a computationally efficient hierarchical softmax. Hope that helps! https://arxiv.org/abs/1609.04309