Open gslaller opened 4 years ago
@gslaller - Seems to be from here: Efficient softmax approximation for GPUs
See: https://twitter.com/Smerity/status/1343159498081366017
The SHA-RNN paper itself only uses it as it was already part of AWD-LSTM. It's the adaptive softmax from linked FAIR paper. Almost all Facebook (FAIR) codebases use it. Essentially a computationally efficient hierarchical softmax. Hope that helps! https://arxiv.org/abs/1609.04309
Can you provide any further information on the loss function you are using? Perhaps a reference to a paper?