allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

Add args.me corpus #133

Closed janheinrichmerker closed 2 years ago

janheinrichmerker commented 2 years ago

Fixes #127.

I'm not sure how to implement the iterator slicing. Therefore, I disabled the check in the integration test (test_iter_split=False). I'd appreciate any help.

I also switched to another ID naming scheme for args.me as the corpus has 3 different versions on Zenodo. Now the following IDs:

janheinrichmerker commented 2 years ago

CC @alebondarenko

seanmacavaney commented 2 years ago

This looks great, thanks @heinrichreimer! The code is absolutely beautiful.

For iterator slicing, you can use @ir_datasets.util.use_docstore -- this will build a docstore (which provides slicing operations) if the user slices, otherwise it uses the simple iterator.

The naming scheme looks good to me.

Can we simplify the data model a bit? The json and data model have a list of premises (List[ArgsMePremise]), but in practice, it's always exactly one. I think the premise_text, premise_stance, and premise_annotations could probably go directly into ArgsMeDoc.

Would you like to be added to the list of credits here? https://github.com/allenai/ir_datasets/#credits If so, what affiliation would you like to list?

alebondarenko commented 2 years ago

@johanneskiesel can you pls have a look at the suggestions from Sean about the args.me data simplifications?

johanneskiesel commented 2 years ago

Thanks for the suggestion! The decision to have a list of premises was made from a theoretical perspective. Our current data does indeed not take advantage of this possibility. I'm myself torn on this question, and have to involve others in this specific decision.

So I would say it is very unlikely that we change from a list to just one premise anytime soon.

seanmacavaney commented 2 years ago

Thanks for the background-- the motivation for having a list of premises makes sense to me.

This comes down to the trade-offs between representing the dataset as closely as possible to the original form vs simplifying the data for practicality. These two use cases are often at odds and are detailed in this issue.

In practice, most tools that use ir_datasets handle flat structures more easily than those with more structured data -- e.g., PyTerrier would be able to index the premise text by doing indexer.index(dataset.get_corpus_iter(), fields=['premise_text']), but you'd need special code to loop over all premises and concatenate the text field of each (or some other technique) to handle the list format. Similar story for other tools like Capreolus, DiffIR, FlexNeuART, OpenNIR. This just seems like an unnecessary burden for the three versions of the datasets above, given that they all only contain one premise.

To be clear- I'm not saying that the dataset's source json needs to change-- I'm just suggesting a simplification for the ir_datasets interface to the dataset. If future versions of the dataset contain multiple premises, they could certainly provide a list of premises instead (because then the user would need to decide how they wanted to handle multiple premises).

janheinrichmerker commented 2 years ago

@johanneskiesel I'd also be happy about suggestions on what fields we need to include from the context field.

johanneskiesel commented 2 years ago

If this is a general "best practice" for ir_datasets to simplify like this (which I gather from your comment), then I'm all for the simplification. I think the consistency within ir_datasets is more important here than a close reflection of our dataset and API.

@heinrichreimer I don't think there is a "must-have" context attribute. But I consider the "sourceUrl" to be useful in several applications.

alebondarenko commented 2 years ago

It's a good comment; just to all of us to keep in mind. The example of the BEIR framework showed us, that they "simplified" the relevance judgments, which is, well, if they want so ... but if someone would use the touche dataset from BEIR and will try to reproduce the results from touche (like evaluation), they will fail due to the labels mismatch.

seanmacavaney commented 2 years ago

Yeah, I'm opposed to simplification that changes the benchmark.

janheinrichmerker commented 2 years ago

May I suggest keeping both? We can have premises as a list and premise, premise_stance, and premise_annotations for the first.

seanmacavaney commented 2 years ago

That's an excellent idea; it supports both use cases perfectly.

seanmacavaney commented 2 years ago

It's similar to what's done for the Washington Post dataset, so there's already precedent.

janheinrichmerker commented 2 years ago

Ok then I would go with an array premises that contains a list of all premises with stance and annotations and a string premises_texts which contains concatenated texts of all premises. That second attribute can then directly be used for basic retrieval and the list might be useful for more sophisticated filtering.

janheinrichmerker commented 2 years ago

I'm currently adding some more attributes from the context JSON object. This includes aspects and I would then apply the same principle as with premises: The aspects field contains a list of aspects with name weight and rank, and the aspects_names field contains all aspects' names concatenated to be used for basic retrieval.

janheinrichmerker commented 2 years ago

I've now added all attributes I could find in the JSON files and updated the integration tests accordingly.

janheinrichmerker commented 2 years ago

Just tested python -m ir_datasets export argsme/1.0 docs --format jsonl and it looks good to me