The tutorials show how to train these models, but not how to get a probability afterwards.
When using cross entropy, the model does not output a probability, because the probability is calculated in the loss function using the softmax function. However, this probability requires negative samples, meaning the probability is dependent on match scores of other pairs. So, when trying to use the model to determine if two documents match, how do we calculate the probability? Are we supposed to create negative samples at test time? This may cause problems for other reasons. Basically, I'm asking what to apply to the logit outputted at test time.
Is softmax the correct function to use with negative samples in cross entropy? Other negative sampling strategies use the sigmoid function on each match independently. For an example, word2vec uses the sigmoid function of the dot product of the word and and its context or a negative sample. For more details on this, see equation 55 on p 13 of this paper: https://arxiv.org/abs/1411.2738 . I may implement my own version of cross entropy using this framework, but I thought I would ask here first.
Edit: I should clarify this is using the RankCrossEntropyLoss, which uses softmax function.
The tutorials show how to train these models, but not how to get a probability afterwards.
When using cross entropy, the model does not output a probability, because the probability is calculated in the loss function using the softmax function. However, this probability requires negative samples, meaning the probability is dependent on match scores of other pairs. So, when trying to use the model to determine if two documents match, how do we calculate the probability? Are we supposed to create negative samples at test time? This may cause problems for other reasons. Basically, I'm asking what to apply to the logit outputted at test time.
Is softmax the correct function to use with negative samples in cross entropy? Other negative sampling strategies use the sigmoid function on each match independently. For an example, word2vec uses the sigmoid function of the dot product of the word and and its context or a negative sample. For more details on this, see equation 55 on p 13 of this paper: https://arxiv.org/abs/1411.2738 . I may implement my own version of cross entropy using this framework, but I thought I would ask here first.
Edit: I should clarify this is using the RankCrossEntropyLoss, which uses softmax function.