Out of curiosity: as for the one-hot embedding experiment described in the paper, did you pass in a plain one-hot representation, or did you pass that one-hot through an embedding layer first, and train that embedding layer together with the rest of the model?
Out of curiosity: as for the one-hot embedding experiment described in the paper, did you pass in a plain one-hot representation, or did you pass that one-hot through an embedding layer first, and train that embedding layer together with the rest of the model?