FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.07k stars 438 forks source link

finetune llm-embedder #692

Open QuangTQV opened 3 months ago

QuangTQV commented 3 months ago

Dear author,I find the documents explaining finetune are limited, can you explain some of the following things to me? : image

Corpus Data Format: Could you please elaborate on the format of the corpus data? I am having difficulty grasping this concept. Could you provide an example to illustrate how the corpus data should be structured?

Key and Key Index in Evaluation File: Within the evaluation file, what specifically do the terms "key" and "key index" refer to?

Understanding "Answer" in Evaluation File: In the evaluation file, what does the term "answer" represent? Where is this data sourced from?

Evaluation label: According to the picture, the evaluation section does not have neg and pos, there is only an index for neg and pos (optional), so how does the model perform evaluation without labels?

image

Key_Template for Retrieve Tool in Fine-Tuning: Regarding the finetune process for the retrieve tool, what exactly is meant by "key_template"? I encountered a reference in the documentation mentioning "How to concatenate columns in the corpus to form one key," but I'm struggling to comprehend this aspect.

I would be immensely grateful if you could shed light on these matters or direct me to resources that could offer further clarity. Your expertise in this field would undoubtedly be invaluable to my understanding.

Thank you very much for considering my questions. I eagerly await your response and look forward to enhancing my comprehension of these crucial details.

Warm regards,

QuangTQV commented 3 months ago

Please help me, thanks

namespace-Pt commented 3 months ago

Hi,

Corpus Data

The corpus is a jsonl file. Each row is a json object, representing a document, which usually consists of:

Key Template

The key_template is used to group the textual field in each document into a single piece of text. For example, the default key_template ({title} {text}) will have the title and the text of each document concatenated with a space.

In tool learning corpus, there is no title, so the key_template is {text}, meaning that only the "text" field is used.

Key and Key Index in Evaluation

The key field is a list of texts retrieved for the query from the corpus. The key_index field is the indices of the retrieved texts. Both fields will be also automatically generated if you specify --metrics collate_key. See https://github.com/FlagOpen/FlagEmbedding/blob/fec9058948215924b924104d48cbf01e2ab90865/FlagEmbedding/llm_embedder/src/retrieval/metrics.py#L246

Evaluation Label

The pos_index field is usually the label. It contains the indices of the positive documents w.r.t. the corpus (i.e. the row indices of all positive documents). On NQ, there are no pre-defined pos_index. In that case, the evaluation is based on whether the retrieved documents contain the corresponding answer. See https://github.com/FlagOpen/FlagEmbedding/blob/fec9058948215924b924104d48cbf01e2ab90865/FlagEmbedding/llm_embedder/src/retrieval/metrics.py#L234