Evaluation of these embedding models

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

MIT License

7.48k stars 538 forks source link

Evaluation of these embedding models #207

Open Rishav-hub opened 1 year ago

Rishav-hub commented 1 year ago

I have finetuned a couple of BAAI/bge-base-en on my own dataset for the Retrieval task.

I have a train and a validation split. Now I need to need a metric to verify which model performed better.

So like for regression we have MSE and for classification, we have the Precision and Recall curve, which metric can be used to verify the retrieved chunks of the document.

Can you provide me with a script for it?

olivierr42 commented 1 year ago

Check out BeIR for ideas. Typically, you evaluate your metrics over a subset of retrieved results, precision and recall being two of the notable metrics. Non discounted cumulative gain (NDCG) is a popular one.

staoxiao commented 1 year ago

You can refer to BeIR for the widely used metrics in retrieval tasks. And we also provide a script to do evaluation on user's data: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#5-evaluate-model-on-msmarco .

Rishav-hub commented 1 year ago

One more question, in pretraining we are just passing the train dataset. How can we evaluate the model after pertaining? Need to know the concept behind it.

staoxiao commented 1 year ago

There is no specific metric for the pre-trained model. You can fine-tune it, and compare the fine-tuned performance between the original model and the pre-trained model.

Rishav-hub commented 1 year ago

Okay, I have finetuned my model on a training set and now I need to evaluate it on my own validation set. Also, why are we not passing any validation dataset path when we are fine-tuning?. LINK

Or, is there any script available where I could get the precision or recall of my model on a evaluation set.

staoxiao commented 1 year ago

The downstream tasks are different for different users, so we don't have an evaluation script during training. There is an example to compute the recall in msmarco dataset: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#5-evaluate-model-on-msmarco, and you can follow it to evaluate your dataset.

Rishav-hub commented 1 year ago

While going through your script for pretraining I found MLM is used for pertaining. Can we use "Perplexity" as a metric for evaluation.

staoxiao commented 1 year ago

Perplexity is an optional metric, but may not be very relevant to the downstream tasks (e.g., retrieval task).