beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

Question about details/scripts for generating the datasets #75

Closed narayanacharya6 closed 2 years ago

narayanacharya6 commented 2 years ago

Given that the framework tries to get all datasets into a single consistent structure, I am trying to find details/scripts about how some of the available datasets were generated from their raw form. I could not find it in this repo.

Also, somewhat related to datasets can you please clarify this paragraph from the paper:

Finally, we notice that there can be a strong lexical bias present in datasets included within the
benchmark, likely as lexical models are pre-dominantly used during the annotation or creation of
datasets. This can give an unfair disadvantage to non-lexical approaches. We analyze this for the
TREC-COVID [63] dataset: We manually annotate the missing relevance judgements for the tested
systems and see a significant performance improvement for non-lexical approaches.

Is the TREC-COVID datasets available via BEIR with these manual annotations or not? If yes, is there a raw version of the TREC-COVID dataset available via BEIR? If no, are these manual annotations available somewhere I can access?

thakur-nandan commented 2 years ago

Hi @narayanacharya6,

The raw form of the datasets is not made available as it defeats the purpose. BEIR provides a single unified format for all datasets. As these all datasets are available publicly, you can go to their original repository for each dataset and get the raw form of the datasets.

Yes, the TREC-COVID annotated dataset is also available here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid-beir.zip as mentioned in Table 7 (Appendix) in the arxiv version of the paper: https://arxiv.org/abs/2104.08663.

Kind Regards, Nandan Thakur

narayanacharya6 commented 2 years ago

I was trying to look for how you converted the raw datasets into the single unified format. Essentially, are there are scripts that convert these datasets from their raw format into the processed datasets available via BEIR?

Also, thanks for pointing me to the trec-covid-beir version of the dataset!

thakur-nandan commented 2 years ago

Hi @narayanacharya6,

Sadly I do not have conversion scripts anymore. Essentially, the conversion is easy as I take the original dataset and wrote simple preprocessing scripts to convert them into a corpus dict with doc_id as key and title and text fields and similar for the query as well with query_id as the key.

Kind Regards, Nandan Thakur

narayanacharya6 commented 2 years ago

No worries. Closing this issue.