Closed kaglowka closed 2 years ago
Excellent spot, thanks @kaglowka!
I suppose that the reason to not add this header to all msmarco datasets in the first place
Truth is, I just didn't think of it :). Happy for this addition!
Wow, that was quick :)
Truth is, I just didn't think of it :). Happy for this addition!
Happy to hear!
I propose to add the special header
X-Ms-Version: 2019-12-12
to all MSMARCO resources larger than 500MB.Motivation As mentioned in the discussion here, MSMARCO source needs a special header
X-Ms-Version: 2019-12-12
so it accepts range requests and so the download is stable.I found that this header is already added in
download.json
e.g. formsmarco-document-v2
docs, but not for other large resources from the same URL.I suppose that the reason to not add this header to all msmarco datasets in the first place was to add them only where it is necessary and not "litter" the config, but in my case all attempts to download
msmarco-document
documents +msmarco-document/orcas
top 100 docs (~10GB) through several different WiFi networks, regardless of 15-minute timeout (either throughir_datasets
requests or commandlinewget
) failed.My OS: Ubuntu 21.10
I suggest adding this special header to all MSMARCO resources bigger than, say, 500MB. I suppose this will be a workable fix, but maybe going even lower wouldn't be such an exaggeration.