Closed ji-xin closed 3 years ago
Hi @Ji-Xin,
I specifically followed the AllenAI package onir_datasets
for loading Robust04. They easily scrape the relevant HTML content into text. For more details, see here - https://ir-datasets.com/trec-robust04.html#trec-robust04
Kind Regards, Nandan
Thanks for the quick response! Just a follow-up question: which field of the topic should we use? For example:
TrecQuery(query_id='301', title='International Organized Crime', description='Identify organizations that participate in international criminal\nactivity, the activity, and, if possible, collaborating organizations\nand the countries involved.', narrative='A relevant document must as a minimum identify the organization and the\ntype of illegal activity (e.g., Columbian cartel exporting cocaine).\nVague references to international drug trade without identification of\nthe organization(s) involved would not be relevant.')
Do we use title, description, or narrative?
I used the description of the query for the results in the leaderboard.
Kind Regards, Nandan
Thanks!
Hi, I saw from https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns/edit#gid=0 that Robust04 has been added to the BEIR leaderboard. Thanks for providing the results! Meanwhile, I was wondering what preprocessing is made for the dataset. For example, which fields of the topics are used for constructing the queries (title, desc, or narr)? Which parts of the documents are included (Headline, Text, Date, etc)? I'm asking because I tried to evaluate ANCE (the publicly released checkpoint) on Robust04 with minor preprocessing, but the ndcg@10 score is around 0.33, which is much lower than 0.39 as reported in the leaderboard. Thanks a lot!