Closed seanmacavaney closed 3 years ago
Using the PID->URL mapping is not as straightforward as I thought it would be.
The main problem is that documents with the same content (even with different URLs) are collapsed into 1 URL mapping in the file. (This probably explains where the 100k new docs come from.) This poses the following challenges:
[pid]-[qnaidx]
, where qnaidx
is incremented as the same documents are encountered with different URLs. Or the document could keep a list of all URLs that it is found in, and the qrel/scoreddoc record specifies the one that's linked to it's appearance in the list.Many decisions...
If I take care of #12 at the same time, could simplify + speed up the implementation by removing the fallback technique.
All that's left now is a migration plan. Needs to be done for both msmarco-qna
and msmarco-passage
(because of the encoding fixes).
I think I'm leaning towards automatic upgrades and using a file (e.g., "version.txt") as an indicator.
Right now,
msmarco-qna
assigns doc_ids sequentially to content, as it's encountered. (It maps documents with the same content via a hash to the same doc_id). This works, but has a few downsides:msmarco-passage
docs, but with different doc_idsmsmarco-passage
. Perhaps they did some soft matching when creating that dataset.It should be possible to change
msmarco-qna
to match themsmarco-passage
doc_ids. The key to this is the passage ID -> URL mapping provided by TREC CAST (http://boston.lti.cs.cmu.edu/vaibhav2/cast/marco_pas_url.tsv). The idea is to find all PIDs that match the URL and then pick the one that has the closest content to the document in question. This will involve first downloading/processing themsmarco-passage
collection, but I don't see that as a big downside, given the upside of having matching doc_ids.One consideration we'll need to make here is how to handle those who have already downloaded/processed the
msmarco-qna
collection. One possibility is to have some file that flags if it used the new doc_id generation technique, and either (1) warn the user that it uses "old" IDs, or (2) automatically clear and re-process the collection. I'm not sure which I prefer, but this will probably set the precedent for future incompatible changes that may be made to datasets in the future.