Closed narayanacharya6 closed 2 years ago
Hi @narayanacharya6,
The raw form of the datasets is not made available as it defeats the purpose. BEIR provides a single unified format for all datasets. As these all datasets are available publicly, you can go to their original repository for each dataset and get the raw form of the datasets.
Yes, the TREC-COVID annotated dataset is also available here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid-beir.zip as mentioned in Table 7 (Appendix) in the arxiv version of the paper: https://arxiv.org/abs/2104.08663.
Kind Regards, Nandan Thakur
I was trying to look for how you converted the raw datasets into the single unified format. Essentially, are there are scripts that convert these datasets from their raw format into the processed datasets available via BEIR?
Also, thanks for pointing me to the trec-covid-beir
version of the dataset!
Hi @narayanacharya6,
Sadly I do not have conversion scripts anymore. Essentially, the conversion is easy as I take the original dataset and wrote simple preprocessing scripts to convert them into a corpus dict
with doc_id
as key and title and text fields and similar for the query as well with query_id
as the key.
Kind Regards, Nandan Thakur
No worries. Closing this issue.
Given that the framework tries to get all datasets into a single consistent structure, I am trying to find details/scripts about how some of the available datasets were generated from their raw form. I could not find it in this repo.
Also, somewhat related to datasets can you please clarify this paragraph from the paper:
Is the TREC-COVID datasets available via BEIR with these manual annotations or not? If yes, is there a raw version of the TREC-COVID dataset available via BEIR? If no, are these manual annotations available somewhere I can access?