Change MSMARCO QnA doc_ids to MSMARCO Passages

seanmacavaney commented 3 years ago

Right now, msmarco-qna assigns doc_ids sequentially to content, as it's encountered. (It maps documents with the same content via a hash to the same doc_id). This works, but has a few downsides:

The content is essentially the same as the msmarco-passage docs, but with different doc_ids
What's worse-- the query IDs do match across these collections.
Strangely, there are more (~100k) documents in qna than in msmarco-passage. Perhaps they did some soft matching when creating that dataset.

It should be possible to change msmarco-qna to match the msmarco-passage doc_ids. The key to this is the passage ID -> URL mapping provided by TREC CAST (http://boston.lti.cs.cmu.edu/vaibhav2/cast/marco_pas_url.tsv). The idea is to find all PIDs that match the URL and then pick the one that has the closest content to the document in question. This will involve first downloading/processing the msmarco-passage collection, but I don't see that as a big downside, given the upside of having matching doc_ids.

One consideration we'll need to make here is how to handle those who have already downloaded/processed the msmarco-qna collection. One possibility is to have some file that flags if it used the new doc_id generation technique, and either (1) warn the user that it uses "old" IDs, or (2) automatically clear and re-process the collection. I'm not sure which I prefer, but this will probably set the precedent for future incompatible changes that may be made to datasets in the future.

seanmacavaney commented 3 years ago

Using the PID->URL mapping is not as straightforward as I thought it would be.

The main problem is that documents with the same content (even with different URLs) are collapsed into 1 URL mapping in the file. (This probably explains where the 100k new docs come from.) This poses the following challenges:

Looking up passages for a URL may not return anything, or it may return only other passages in the document. Thus, doing something like the maximum ngram overlap isn't feasible.
It doesn't seem like the pid-url file doesn't always pick the first encounter for mapping the URL (maybe it's the last? random?) So processing the file sequentially as we do now will not work. (Maybe a 2-pass approach would though?)
We want to keep URLs in the documents, so how do we represent documents with the same content but different URLs? One way would be to prefix/suffix the MS MARCO PIDs, like [pid]-[qnaidx], where qnaidx is incremented as the same documents are encountered with different URLs. Or the document could keep a list of all URLs that it is found in, and the qrel/scoreddoc record specifies the one that's linked to it's appearance in the list.

Many decisions...

seanmacavaney commented 3 years ago

If I take care of #12 at the same time, could simplify + speed up the implementation by removing the fallback technique.

seanmacavaney commented 3 years ago

All that's left now is a migration plan. Needs to be done for both msmarco-qna and msmarco-passage (because of the encoding fixes).

I think I'm leaning towards automatic upgrades and using a file (e.g., "version.txt") as an indicator.

allenai / ir_datasets

Change MSMARCO QnA doc_ids to MSMARCO Passages #45