allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

Change MSMARCO QnA doc_ids to MSMARCO Passages #45

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

Right now, msmarco-qna assigns doc_ids sequentially to content, as it's encountered. (It maps documents with the same content via a hash to the same doc_id). This works, but has a few downsides:

It should be possible to change msmarco-qna to match the msmarco-passage doc_ids. The key to this is the passage ID -> URL mapping provided by TREC CAST (http://boston.lti.cs.cmu.edu/vaibhav2/cast/marco_pas_url.tsv). The idea is to find all PIDs that match the URL and then pick the one that has the closest content to the document in question. This will involve first downloading/processing the msmarco-passage collection, but I don't see that as a big downside, given the upside of having matching doc_ids.

One consideration we'll need to make here is how to handle those who have already downloaded/processed the msmarco-qna collection. One possibility is to have some file that flags if it used the new doc_id generation technique, and either (1) warn the user that it uses "old" IDs, or (2) automatically clear and re-process the collection. I'm not sure which I prefer, but this will probably set the precedent for future incompatible changes that may be made to datasets in the future.

seanmacavaney commented 3 years ago

Using the PID->URL mapping is not as straightforward as I thought it would be.

The main problem is that documents with the same content (even with different URLs) are collapsed into 1 URL mapping in the file. (This probably explains where the 100k new docs come from.) This poses the following challenges:

Many decisions...

seanmacavaney commented 3 years ago

If I take care of #12 at the same time, could simplify + speed up the implementation by removing the fallback technique.

seanmacavaney commented 3 years ago

All that's left now is a migration plan. Needs to be done for both msmarco-qna and msmarco-passage (because of the encoding fixes).

I think I'm leaning towards automatic upgrades and using a file (e.g., "version.txt") as an indicator.