Clarification on Prompt Usage and Special Tokens in LLARA-Passage Code

Dear Authors,

Firstly, thank you for your insightful paper, "Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval." I found it highly informative and am excited about its potential applications.

While studying the paper and experimenting with the code provided, I noticed some discrepancies between the described methodology and the actual implementation in the BAAI/LLARA-passage model. Specifically, in Section 3.2 and the fine-tuning section of the paper, you mention:

For pretraining, the model uses the NEXT prompt to generate query embeddings and the SELF prompt for answer embeddings. The NEXT prompt is defined as "The next sentence is:". The SELF prompt is defined as "The input sentence is:". You note that while fine-tuning, the formulation can be changed into N2N (NEXT-to-NEXT) or S2S (SELF-to-SELF) but do not indicate that the NEXT and SELF prompts themselves would be altered. However, in the code from your GitHub repository, the prompts and use of special tokens differ:

Prompts in the Code:

Query Inputs: prefix = '"' suffix = '", predict the following passage within eight words: ' Passage Inputs: prefix = '"' suffix = '", summarize the above passage within eight words: ' Use of Special Tokens:

The code includes special tokens to , which are not mentioned in the paper. Given these differences, I have a few questions:

Prompt Discrepancy:

Why do the prompts in the code differ from those described in the paper? In the paper, the prompts are simply NEXT ("The next sentence is:") and SELF ("The input sentence is:"). The code uses more elaborate prompts like "predict the following passage within eight words" and "summarize the above passage within eight words". Could you please explain the rationale behind this change?

Alignment with the Paper:

Is the code an exact implementation of the methods described in the paper (which is just a fine-tuned version of Llama2Vec on MS MARCO), or are there modifications specific to the BAAI/LLARA-passage model? I'm curious if the Llama2Vec fine-tuned on MS MARCO described in your paper was trained using these modified prompts and special tokens, as shown in the code, but not mentioned in detail in your paper. Or, during fine-tuning on MS MARCO, did you use the original NEXT ("The next sentence is:") and SELF ("The input sentence is:") prompts as described in the paper? To accurately reproduce the implementation detailed in your paper, how should I construct the prompts for queries and passages while fine-tuning on MS MARCO?

To conclude, should I use the modified prompts and special tokens as in the code, or adhere to the original NEXT and SELF prompts mentioned in the paper? Are there any additional details about the prompts or special tokens that are important for replication but were not included in the paper? (For example, you write that we should use the NEXT and SELF prompts, but did not wrote in detail that we should change the NEXT and SELF prompt itself while fine-tuning.)

Thank you for your time and for contributing such valuable research to the community.

FlagOpen / FlagEmbedding

Clarification on Prompt Usage and Special Tokens in LLARA-Passage Code #1129