FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.19k stars 522 forks source link

Clarification on Prompt Usage and Special Tokens in LLARA-Passage Code #1129

Open jhy12 opened 1 week ago

jhy12 commented 1 week ago

Dear Authors,

Firstly, thank you for your insightful paper, "Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval." I found it highly informative and am excited about its potential applications.

While studying the paper and experimenting with the code provided, I noticed some discrepancies between the described methodology and the actual implementation in the BAAI/LLARA-passage model. Specifically, in Section 3.2 and the fine-tuning section of the paper, you mention:

For pretraining, the model uses the NEXT prompt to generate query embeddings and the SELF prompt for answer embeddings. The NEXT prompt is defined as "The next sentence is:". The SELF prompt is defined as "The input sentence is:". You note that while fine-tuning, the formulation can be changed into N2N (NEXT-to-NEXT) or S2S (SELF-to-SELF) but do not indicate that the NEXT and SELF prompts themselves would be altered. However, in the code from your GitHub repository, the prompts and use of special tokens differ:

Prompts in the Code:

Query Inputs: prefix = '"' suffix = '", predict the following passage within eight words: ' Passage Inputs: prefix = '"' suffix = '", summarize the above passage within eight words: ' Use of Special Tokens:

The code includes special tokens to , which are not mentioned in the paper. Given these differences, I have a few questions:

  1. Prompt Discrepancy:

Why do the prompts in the code differ from those described in the paper? In the paper, the prompts are simply NEXT ("The next sentence is:") and SELF ("The input sentence is:"). The code uses more elaborate prompts like "predict the following passage within eight words" and "summarize the above passage within eight words". Could you please explain the rationale behind this change?

  1. Alignment with the Paper:

Is the code an exact implementation of the methods described in the paper (which is just a fine-tuned version of Llama2Vec on MS MARCO), or are there modifications specific to the BAAI/LLARA-passage model? I'm curious if the Llama2Vec fine-tuned on MS MARCO described in your paper was trained using these modified prompts and special tokens, as shown in the code, but not mentioned in detail in your paper. Or, during fine-tuning on MS MARCO, did you use the original NEXT ("The next sentence is:") and SELF ("The input sentence is:") prompts as described in the paper? To accurately reproduce the implementation detailed in your paper, how should I construct the prompts for queries and passages while fine-tuning on MS MARCO?

To conclude, should I use the modified prompts and special tokens as in the code, or adhere to the original NEXT and SELF prompts mentioned in the paper? Are there any additional details about the prompts or special tokens that are important for replication but were not included in the paper? (For example, you write that we should use the NEXT and SELF prompts, but did not wrote in detail that we should change the NEXT and SELF prompt itself while fine-tuning.)

Thank you for your time and for contributing such valuable research to the community.

545999961 commented 1 week ago
  1. In the paper, the terms "SELF" and "NEXT" are used merely as referential labels to better describe the contents that "SELF" and "NEXT" represent.
  2. The N2N and S2S prompts are not employed during the fine-tuning process. Instead, the fine-tuned models are capable of using them.
  3. BAAI/LLARA-passage is the passage version of Llama2Vec.
  4. Regarding the use of prompts and special tokens, these do not significantly influence the overall results. If fine-tuning is based on LLARA-pretrain, it is essential to use the same prompts as those used during the pretraining phase, as mentioned in finetune section. However, if you start from Llama and then proceed to pre-training and fine-tuning, you may use either the prompts mentioned in the code or those discussed in the paper. The key requirement is to maintain consistency in the prompts used during both pre-training, fine-tuning and inferencing.