allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

Add BioASQ dataset to the list of supported BEIR datasets #250

Open MathVast opened 9 months ago

MathVast commented 9 months ago

Hi @seanmacavaney I would like to use the BioASQ dataset for an experiment and I have stumbled across this on the GitHub repo of the BEIR paper beir-cellar where the author links the preprocessed data for the 4 datasets marked as "unavailable". I am aware that you've been trying to extend the list of available datasets from the benchmark on ir_datasets (ie. this issue) and I was wondering if, given these resources, BioASQ could be integrated to the catalog?

Dataset Information:

BioASQ is a dataset featuring in the BEIR benchmark and originated from a challenge around "biomedical semantic indexing and question answering". More information about the challenge and the dataset can be found here: http://bioasq.org/

Links to Resources:

Link to the steps listed on beir-cellar in order to reproduce the files: https://github.com/beir-cellar/beir/tree/main/examples/dataset#2-bioasq ; Link to the Google Drive space linked in the issue cited above where the preprocessed data can be found: https://drive.google.com/drive/folders/1CgDO-KmQQMpGEGeD3R20ZgTTM008xix9

Dataset ID(s) & supported entities:

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

seanmacavaney commented 9 months ago

Hey @MathVast! Sorry for the delay -- the start of semester is a busy time.

Thanks for opening the issue. This seems doable and like a good addition to the package.

MathVast commented 9 months ago

No problem, in the meantime I've made a fork and worked on the integration in ir_datasets of BioASQ on my side. I've been playing with the dataset through XPM-IR and it seems to be working but you might want to check some of the choices I've made. If it's okay for you @seanmacavaney I can open a PR.