allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

TREC Health Misinformation 2020-21 #41

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

Dataset Information:

2021 will be the third year of this track. We already have 2019 (clueweb12/b13/trec-misinfo-2019), but should add 2020 (documents from CommonCrawl) and 2021 (TBD).

This is a placeholder for future information.

Links to Resources:

Dataset ID(s):

<propose dataset ID(s), and where they fit in the hierarchy>

Supported Entities

Additional comments/concerns/ideas/etc.

seanmacavaney commented 2 years ago

2021 uses the C4 corpus. Details here:

AIMS AND SCOPE

Misinformation represents a key problem when using search engines to guide any decision-making task: Are users able to discern authoritative from unreliable information and correct from incorrect information? This problem is further exacerbated when the search occurs within uncontrolled data collections, such as the web, where information can be unreliable, misleading, highly technical, and can lead to unfounded escalations. Information from search-engine results can significantly influence decisions, and research shows that increasing the amount of incorrect information about a topic presented in a Search Engine Result Page (SERP) can impel users to make incorrect decisions.

In this context, the TREC 2021 Misinformation track fosters research on retrieval methods that promote reliable and correct information over misinformation. The track offers the following task:

  • Ad-hoc Retrieval Task: The goal is to design a ranking model that promotes credible and correct information over incorrect information;

GUIDELINES

  • Corpus: noclean version of the C4 dataset (https://huggingface.co/datasets/allenai/c4);
  • Topics: about consumer health search (people seeking health advice online);
  • Runs: runs may be either automatic or manual with the standard TREC run format.

Detailed guidelines on: https://trec-health-misinfo.github.io

IMPORTANT DATES

  • Runs due from participants: September 2, 2021
  • Evaluation results returned: End of September 2021
  • Notebook paper due: October 2021
  • TREC 2021 Conference: November 17-19, 2020
  • Final paper due: February 2021

ORGANIZERS

  • Charles Clarke, University of Waterloo
  • Maria Maistro, University of Copenhagen
  • Mark Smucker, University of Waterloo

CONTACT

For more information or to ask questions, join the google group: https://groups.google.com/forum/#!forum/trec-health-misinformation-track

seanmacavaney commented 2 years ago

Added 2021. The 2020 corpus (CC News) will be a bit of a pain to add. I wonder if there's overlap with CC-News-En (#63)?

isspek commented 2 years ago

Hi @seanmacavaney,

Thank you so much for adding TREC Health Misinformation 2021. I assume that 'c4' is the key to retrieve the dataset. However, when I retrieve the model: I see Dataset() and then docs_iter() gives an AttributeError. I would like to know how I can check the dataset.

Thanks in advance. Ipek

seanmacavaney commented 2 years ago

Hi @isspek,

You'll want to use c4/en-noclean-tr for the document corpus; it's the particular split of C4 used for this track.

You could also use c4/en-noclean-tr/trec-misinfo-2021, which bundles together both the documents and the queries. (And will eventually also include the qrels, once released.)

seanmacavaney commented 2 years ago

Note that it takes some time to download the source files. If you already have them, you can link the directory to ~/.ir_datasets/c4/en.noclean/

isspek commented 2 years ago

Thank you so much for your help, it worked in my local environment. If I use the library with Google Colab, installing pip and then importing misinfo dataset, that gives me an error of not found. Therefore, I tried the second approach for installing ir_datasets, this time Colab throws this error: ERROR: ir_datasets-0.4.2-py2-none-any.whl is not a supported wheel on this platform. Do you know how I can install the version containing this dataset on the Colab?

seanmacavaney commented 2 years ago

Yup- you can just run this:

!pip install git+https://github.com/allenai/ir_datasets.git

I can add this to the readme.

But you'll probably have issues with the c4 datasets on colab -- the source files are several TB in size, which will be too large for the runtime.