Pre-processing code for missing datasets

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

http://beir.ai

Apache License 2.0

1.54k stars 182 forks source link

Pre-processing code for missing datasets #86

Closed memray closed 2 years ago

memray commented 2 years ago

Hi there,

I wonder if the code for preprocessing unprovided datasets could be released? For example, I downloaded bioasq and signal1m following the instruction. But it's not clear to me how to convert the raw dataset to corpus.jsonl, queries.jsonl, qrels/{train/dev/test}.tsv the same way as you did. I think it's critical for reproducing your results and fair benchmarking.

Thank you, Rui

thakur-nandan commented 2 years ago

Hi @memray,

You can send out an email to nandant@gmail.com. I can send you the datasets privately. Please ensure you are responsible for accepting the licenses for all private datasets.

Kind Regards, Nandan Thakur

memray commented 2 years ago

Thanks @NThakur20 ! I'll reach out to you via email later.

gsgoncalves commented 2 years ago

Hi @NThakur20, I sent an email as well. It would be great if you could share the details for the html2text initialization for TREC News, and which Anserini tweet indexing options were used for Signal AI. Thanks!

Cyril-JZ commented 1 year ago

Hi @thakur-nandan,

I also sent an email to kindly request access to the TREC-News dataset TGZ files. I assure you that the dataset will be used solely for academic purposes. I greatly appreciate your assistance and support!

Thanks! im.jzfeng@gmail.com

thakur-nandan commented 1 year ago

Hi @gsgoncalves and @Cyril-JZ, you can find the private BEIR datasets here (all preprocessed): https://drive.google.com/drive/folders/1CgDO-KmQQMpGEGeD3R20ZgTTM008xix9?usp=sharing.

Hope it helps!

Thanks!

Cyril-JZ commented 1 year ago

Thanks for your prompt reply! It helps me a lot!