Georgetown-IR-Lab / OpenNIR

An end-to-end neural ad-hoc ranking pipeline.
https://opennir.net
MIT License
148 stars 25 forks source link

Better way to handle external files #11

Open seanmacavaney opened 4 years ago

seanmacavaney commented 4 years ago

Introduced in #8 - found that dependent file from MS-MARCO has gone missing mysteriously (https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.tar.gz). Contacted dataset owners but got no response, so self-hosting for now.

Really shows how fragile the system could be if required files disappear like this. Perhaps it's good to mirror all files (at least, files that there's no licensing restriction for hosting elsewhere) some other place, and have some method for falling back onto the mirrors if they go missing. I know this has been an issue for TREC in the past with govt shutdowns. This is all the more reason to include hashes for all the files.

Maybe introduce a new class to manage this. Something that could operate like:

dl = util.Downloadable("http://location/1", "http://mirror/1", "http://mirror/2", expected_md5="...")
for line in dl.download_stream():
    ...

Some download operations (like saving files directly or as tmp files) would work with this paradigm. But how could we handle switching to mirrored version of streamed content if hash doesn't match at end?