beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.49k stars 177 forks source link

Adding the ANTIQUE dataset to BEIR #130

Open heliah opened 1 year ago

heliah commented 1 year ago

Hi,

I am a creator of the ANTIQUE dataset -- a passage retrieval dataset for non-factoid questions answering. Please find the paper that explains the data here. I think it could be beneficial, and would like to add it to the BEIR benchmark. Please let me know if you need me to take any action.

Thank you, Helia Hashemi (hhashemi@cs.umass.edu)

thakur-nandan commented 1 year ago

Hi @heliah,

Thank you for sharing the ANTIQUE dataset. Interestingly the dataset contains non-factoid questions, where retrieval models require to understand the meaning of the passages to judge their relevancy for a given query.

One question on the domain for the ANTIQUE dataset, this covers Wikipedia right?

Currently we are thinking of developing the next version of the BEIR benchmark and thinking of more diversity in terms of domains and tasks. We will reach out when we reach the dataset finalization stage.

Kind Regards, Nandan Thakur

heliah commented 1 year ago

Thanks for your prompt reply!

No actually, the collection and queries both come from CQA websites. So it contains diverse non-factoid questions with open-ended answers in informal language. There is a good mixture of queries from technical, to opinion-based, or even queries related to daily tasks. The dataset is unique in a sense that many queries and passages contain sarcasm, rhetorical questions, cultural context, etc. Generally speaking, most challenges arise in understanding (and ranking) day to day human interactions (and information needs).

The relevance annotations are also collected through pooling (similar to TREC) so it's among the very few (if not the only) CQA datasets with "complete" relevance annotations. The relevance annotations are provided in four levels (from 1 to 4). It has both training and test splits.

There are already a few CQA datasets in BEIR, but they either focus on a very specific domain (e.g., financial) or just focus on duplicate question retrieval. So I hope ANTIQUE introduces a novel angle to the great collection of datasets in BEIR.

Please let me know if you have any other questions. Helia Hashemi

On Tue, Mar 14, 2023 at 2:42 PM Nandan Thakur @.***> wrote:

Hi @heliah https://github.com/heliah,

Thank you for sharing the ANTIQUE dataset. Interestingly the dataset contains non-factoid questions, where retrieval models require to understand the meaning of the passages to judge their relevancy for a given query.

One question on the domain for the ANTIQUE dataset, this covers Wikipedia right?

Currently we are thinking of developing the next version of the BEIR benchmark and thinking of more diversity in terms of domains and tasks. We will reach out when we reach the dataset finalization stage.

— Reply to this email directly, view it on GitHub https://github.com/beir-cellar/beir/issues/130#issuecomment-1468881988, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRLGVZI5QEC7MXWLYOYVJLW4DQ3TANCNFSM6AAAAAAVYMOJKU . You are receiving this because you were mentioned.Message ID: @.***>

nreimers commented 1 year ago

Sounds interesting, would be nice to have it added to BEIR.

@heliah Do you have the files in the BEIR format?

heliah commented 1 year ago

Hi Nils,

Sorry for the delayed response. Apparently, I missed your message. I converted the Antique dataset to the BEIR format. You can download beir-version.zip from https://ciir.cs.umass.edu/downloads/Antique/

Please let me know if you have any questions.

Best, Helia

On Thu, Mar 16, 2023 at 4:22 AM Nils Reimers @.***> wrote:

Sounds interesting, would be nice to have it added to BEIR.

@heliah https://github.com/heliah Do you have the files in the BEIR format?

— Reply to this email directly, view it on GitHub https://github.com/beir-cellar/beir/issues/130#issuecomment-1471505873, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADRLGV7KUQN7BLQ3NWKL5SDW4LEVTANCNFSM6AAAAAAVYMOJKU . You are receiving this because you were mentioned.Message ID: @.***>