allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

Touché 2020 and 2021 #135

Closed janheinrichmerker closed 2 years ago

janheinrichmerker commented 2 years ago

Finally had time to implement the Touché datasets :partying_face: I'd suggest a different naming scheme, as we now already have the corpora in ir_datasets that were used in Touché 2020 and 2021:

Thus I'd think it's unnecessary to start the "paths" with a corpus version and instead start with the year component like this:

Then I'd like to specify the shared tasks like this:

Now some tasks have multiple qrels, so I'd then further split like this:

All different versions are also explained in the YAML documentation and some have specific qrel definitions in the dataset definition.

(BTW this PR fixes #125 :wink:)

janheinrichmerker commented 2 years ago

I wonder if you could maybe publish a release with the newly added datasets and the many other changes you did since the last release?

seanmacavaney commented 2 years ago

Wow- you're right, it has been a bit.

Is it important that these changes are included in the release? I likely don't have the time to review these changes today, but I can pretty easily bump versions and release what's in the main branch now.

janheinrichmerker commented 2 years ago

I'd then rather wait and release Touché together with args.me :smiley:

janheinrichmerker commented 2 years ago

It's not urgent though.

seanmacavaney commented 2 years ago

Let me see if I can squeeze in a review of this today

seanmacavaney commented 2 years ago

Thanks again for the great work @heinrichreimer! After a quick scan, I propose the following changes:

The datasets are usually organised hierarchically with the corpus at the upper level. E.g., there are versions of TREC Web, TREC Health Misinformation, NTCIR WWW, and CLEF eHealth all under clueweb12. I think we should do the same thing here. I propose:

It looks like we can merge both the quality and relevance assessments into the same qrel records, since it's the same query-doc pairs being judged. clueweb12/b13/trec-misinfo-2019 is an example of a dataset that does something similar.

janheinrichmerker commented 2 years ago

Very good suggestions! I'll move the datasets and merge the quality and relevance qrels.

janheinrichmerker commented 2 years ago

One thing to keep in mind is that in Touché 2022 (which I'm planning to add as well once qrels are available) uses derived corpora from args.me as well as custom corpora. For the extracted passage corpora derived from argsme-2020-04-01 I think we could then have custom documents in the argsme-2020-04-01/touche-2022-task-1 dataset ID, for example. (See task 1 description.) For 2022 task 2 would I then need to add the dataset with the ID webis/touche-2022-task-2? (See task 2 description.) What other "path structure" would be ideal here? (As adding Touché 2022 is not directly related to this PR, we might also discuss this in another issue.)

seanmacavaney commented 2 years ago

Thanks for the heads up!

For `22 Task 1, I'd lean towards:

For '22 Task 2, do we know about the corpus? It looks like they are passages derived from a subset of clueweb12. Do you think it'll be used beyond the '22 task? I think good options would be along the lines of:

But there's obviously no "right" answer for any of these -- I'm open to alternatives.

Some of these IDs are getting pretty long. I wonder if we should shorten touche-YYYY-task-X to toucheYY-taskX (e.g., touche22-task1) to keep them (slightly) more manageable?

BTW- it can be helpful to add the docs and queries for '22 now, even before the qrels are released, to help folks participate in the task. Qrels can be added once they are released.

janheinrichmerker commented 2 years ago

There we go :smile:

seanmacavaney commented 2 years ago

Awesome work, as always. Thanks, @heinrichreimer!

janheinrichmerker commented 2 years ago

Thanks for the release!