beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.61k stars 192 forks source link

Dose this framework support function to make a summary about model performance on different datasets or single dataset in different model performance and support dynamic benchmark ? #1

Closed svjack closed 3 years ago

svjack commented 3 years ago

I think this benchmark may have the function to support choose the best model from a model list, by compare the performance measurements on one dataset among them. This require the dataset have same interface.

And support a model combination choose support to switch the use model by different semantic feature (sometime use “bm25”, sometime use “sbert” , switch by feature character), to make the final conclusion more consistently.

This will make this benchmark not only a benchmark, but a meta ensemble model framework to combine and improve the final performance on single dataset wth different features.

svjack commented 3 years ago

This is the same suggestion to EasyNMT, you give the user many choices, you should also give some advice or “the best”.

svjack commented 3 years ago

just like (pmlb) https://github.com/EpistasisLab/pmlb/blob/master/examples/fetch_nearest_datasets.ipynb doing, it provide a dataset suggestion on dataset, but i seek a same function in nlp task ( in dataset compare or model compare ), in this project.

svjack commented 3 years ago

I think https://github.com/facebookresearch/anli https://arxiv.org/pdf/1910.14599.pdf is the framework support this function in Natural Language Understanding which “ combine SNLI+MNLI+FEVER-NLI and up-sample different rounds of ANLI to train the models. ” And i think you can try this in your IR domain, to have this similar function.

svjack commented 3 years ago

I think https://github.com/facebookresearch/anli https://arxiv.org/pdf/1910.14599.pdf is the framework support this function in Natural Language Understanding which “ combine SNLI+MNLI+FEVER-NLI and up-sample different rounds of ANLI to train the models. ” And i think you can try this in your IR domain, to have this similar function. Use the idea in https://github.com/facebookresearch/ParlAI/tree/mastering_the_dungeon/projects/mastering_the_dungeon https://arxiv.org/pdf/1711.07950.pdf

svjack commented 3 years ago

I think https://github.com/facebookresearch/anli https://arxiv.org/pdf/1910.14599.pdf is the framework support this function in Natural Language Understanding which “ combine SNLI+MNLI+FEVER-NLI and up-sample different rounds of ANLI to train the models. ” And i think you can try this in your IR domain, to have this similar function.

All about dynamic benchmarks

thakur-nandan commented 3 years ago

I think this benchmark may have the function to support choose the best model from a model list, by compare the performance measurements on one dataset among them. This require the dataset have same interface.

And support a model combination choose support to switch the use model by different semantic feature (sometime use “bm25”, sometime use “sbert” , switch by feature character), to make the final conclusion more consistently.

This will make this benchmark not only a benchmark, but a meta ensemble model framework to combine and improve the final performance on single dataset wth different features.

Yes, our experiments with all the datasets are coming out soon in our paper, and I will add the performance scores once the paper pre-print is out. Yes, all the datasets have the same interface and can be downloaded from here - https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/.

thakur-nandan commented 3 years ago

This is the same suggestion to EasyNMT, you give the user many choices, you should also give some advice or “the best”.

Yes, I also plan to add suggestions, on which model performs the best on a task and more details. For now, I would suggest BM25 (Lexical) and the distilroberta-base-msmarco-v2 SBERT (Dense) model are strong models, and you could use them.

svjack commented 3 years ago

This is the same suggestion to EasyNMT, you give the user many choices, you should also give some advice or “the best”.

Yes, I also plan to add suggestions, on which model performs the best on a task and more details. For now, I would suggest BM25 (Lexical) and the distilroberta-base-msmarco-v2 SBERT (Dense) model are strong models, and you could use them.

I always review some code in your example dir. I find that they also provide some "train" examples such as use sbert cross-encoder to filter and save data first and train on these filtered data. Because sbert also a project in UKP lab I want to know your future support about "train" with different models such as Google universal encoder and so on (i know Google not release the training code) So do you have these plan to support more "train" with user defined different models (with same interface) ?

svjack commented 3 years ago

This is the same suggestion to EasyNMT, you give the user many choices, you should also give some advice or “the best”.

Yes, I also plan to add suggestions, on which model performs the best on a task and more details. For now, I would suggest BM25 (Lexical) and the distilroberta-base-msmarco-v2 SBERT (Dense) model are strong models, and you could use them.

I always review some code in your example dir. I find that they also provide some "train" examples such as use sbert cross-encoder to filter and save data first and train on these filtered data. Because sbert also a project in UKP lab I want to know your future support about "train" with different models such as Google universal encoder and so on (i know Google not release the training code) So do you have these plan to support more "train" with user defined different models (with same interface) ?

This will make the project not only a benchmarks but a toolkit to improve.

thakur-nandan commented 3 years ago

This is the same suggestion to EasyNMT, you give the user many choices, you should also give some advice or “the best”.

Yes, I also plan to add suggestions, on which model performs the best on a task and more details. For now, I would suggest BM25 (Lexical) and the distilroberta-base-msmarco-v2 SBERT (Dense) model are strong models, and you could use them.

I always review some code in your example dir. I find that they also provide some "train" examples such as use sbert cross-encoder to filter and save data first and train on these filtered data. Because sbert also a project in UKP lab I want to know your future support about "train" with different models such as Google universal encoder and so on (i know Google not release the training code) So do you have these plan to support more "train" with user defined different models (with same interface) ?

Yes, we provide training code and examples for SBERT bi-encoder for retriever training and in the future will wish to add training code for SBERT cross-encoder for query generation and filtration as well. For our experiments, we find SBERT models outperforming DPR and USE-QA plus convenient since SBERT is documented well. I won't be able to add training methods for models apart from SBERT for now.

thakur-nandan commented 3 years ago

Closing the Issue due to no recent activity!