allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
317 stars 42 forks source link

Allow downloads to resume for all MSMARCO dataset resources larger than 500MB #186

Closed kaglowka closed 2 years ago

kaglowka commented 2 years ago

I propose to add the special header X-Ms-Version: 2019-12-12 to all MSMARCO resources larger than 500MB.

Motivation As mentioned in the discussion here, MSMARCO source needs a special header X-Ms-Version: 2019-12-12 so it accepts range requests and so the download is stable.

I found that this header is already added in download.json e.g. for msmarco-document-v2 docs, but not for other large resources from the same URL.

I suppose that the reason to not add this header to all msmarco datasets in the first place was to add them only where it is necessary and not "litter" the config, but in my case all attempts to download msmarco-document documents + msmarco-document/orcas top 100 docs (~10GB) through several different WiFi networks, regardless of 15-minute timeout (either through ir_datasets requests or commandline wget) failed.

My OS: Ubuntu 21.10

I suggest adding this special header to all MSMARCO resources bigger than, say, 500MB. I suppose this will be a workable fix, but maybe going even lower wouldn't be such an exaggeration.

seanmacavaney commented 2 years ago

Excellent spot, thanks @kaglowka!

I suppose that the reason to not add this header to all msmarco datasets in the first place

Truth is, I just didn't think of it :). Happy for this addition!

kaglowka commented 2 years ago

Wow, that was quick :)

Truth is, I just didn't think of it :). Happy for this addition!

Happy to hear!