Robust04 preprocessing - Githubissues

beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

http://beir.ai

Apache License 2.0

1.61k stars 192 forks source link

Robust04 preprocessing #13

Closed ji-xin closed 3 years ago

ji-xin commented 3 years ago

Hi, I saw from https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns/edit#gid=0 that Robust04 has been added to the BEIR leaderboard. Thanks for providing the results! Meanwhile, I was wondering what preprocessing is made for the dataset. For example, which fields of the topics are used for constructing the queries (title, desc, or narr)? Which parts of the documents are included (Headline, Text, Date, etc)? I'm asking because I tried to evaluate ANCE (the publicly released checkpoint) on Robust04 with minor preprocessing, but the ndcg@10 score is around 0.33, which is much lower than 0.39 as reported in the leaderboard. Thanks a lot!

thakur-nandan commented 3 years ago

Hi @Ji-Xin, I specifically followed the AllenAI package onir_datasets for loading Robust04. They easily scrape the relevant HTML content into text. For more details, see here - https://ir-datasets.com/trec-robust04.html#trec-robust04

Kind Regards, Nandan

ji-xin commented 3 years ago

Thanks for the quick response! Just a follow-up question: which field of the topic should we use? For example:

TrecQuery(query_id='301', title='International Organized Crime', description='Identify organizations that participate in international criminal\nactivity, the activity, and, if possible, collaborating organizations\nand the countries involved.', narrative='A relevant document must as a minimum identify the organization and the\ntype of illegal activity (e.g., Columbian cartel exporting cocaine).\nVague references to international drug trade without identification of\nthe organization(s) involved would not be relevant.')

Do we use title, description, or narrative?

thakur-nandan commented 3 years ago

I used the description of the query for the results in the leaderboard.

Kind Regards, Nandan

ji-xin commented 3 years ago

Thanks!