Open zhhu1996 opened 1 month ago
@jiaqizhai
Hi, I also have same question. But I did some debugging on the training code provided by the author for the public dataset, and below is my analysis of this loss function :
ar_loss
, we can see that there is a one-token mismatch between output_embeddings
and supervision_embeddings
. This is used to calculate the loss function for the next token prediction.self._model.interaction
is used to calculate the similarity between the predicted token embedding and the positive sample embedding as well as the negative sample embeddings. Common calculation methods include the dot product (the author's code also uses the dot product to calculate similarity). If you are familiar with contrastive learning, this is one of the steps in calculating the contrastive loss. Through self._model.interaction
, positive and negative logits are obtained, and then the final loss function is calculated.jagged_loss = -F.log_softmax(torch.cat([positive_logits, sampled_negatives_logits], dim=1), dim=1)[:, 0]
is a standard process of calculating the contrastive loss function. If I understand correctly, the code is equivalent to the following equation
$\text{loss} = -\log\left(\frac{e^{y^+}}{e^{y^+} + \sum_{i=1}^{n}e^{y^-_i}}\right)$
where $e^{y^+}$ is positive logits and $e^{y^-_i}$ is sampled_negatives.So those are my personal understanding, there may be some mistakes. Discussions are welcome, and it would be better if the authors could provide official explanations!
Hi, thanks for your interest in our work and for @Blank-z0's explanations!
1-4/ are correct. To elaborate a bit more on 3/ - we abstract out similarity function computations in this codebase, in order to support alternative learned similarity functions like FMs, MoL, etc. besides dot products in a unified API. The experiments reported in the ICML paper were all conducted with dot products / cosine similarity to simplify discussions. Further references/discussions for learned similarities can be found in Revisiting Neural Retrieval on Accelerators, KDD'23, with follow up work by LinkedIn folks in LiNR: Model Based Neural Retrieval on GPUs at LinkedIn, CIKM'24; we've also provided experiment results that integrate HSTU and MoL in Efficient Retrieval with Learned Similarities (but this paper is more about theoretical justifications for using learned similarities).
Hey, Congratulations for your perfect and creative work. when I read the implementation code here, I am very confused about SampledSoftmaxLoss. I have some questions for this:
Please give me some advice if you are free, thanks~