Thanks for your amazing work firstly.
But I have some questions about the proposed method:
The paper demonstrates that CLIFF’s improvements over the cross entropy baseline are more consistent compared with Unlikelihood method. But performance drop in ROUGE-L still occurs when using contrastive loss and why is that?
The paper also says that the key advantage of CLIFF resides in its measure of representation similarities between positive and negative samples in the same batch. Is this referring to the contrastive loss formulation (1)?
The ROUGE-L score drops mainly on XSum. The XSum dataset is constructed by taking the first sentence as the summary. However, the first sentence sometimes contains information that might not be covered by the document (and the information is not world knowledge), which encourages the model to hallucinate for higher ROUGE-L scores. When improving the consistency, the model hallucinates less and might therefore produce less word overlap.
Thanks for your amazing work firstly. But I have some questions about the proposed method: