allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

check adequate storage #86

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

Is your feature request related to a problem? Please describe. Many datasets require substantial storage. Sometimes the user can get errors due to inadequate storage.

Describe the solution you'd like It would be nice if the user was both (1) informed about the amount of storage it will consume, and (2) fail early if there won't be enough. Both for downloads and other operations that use a lot of storage (e.g., building docstores).

Describe alternatives you've considered (none)

Additional context

Adding a file size to each item in downloads.json would be nice documentation to have to begin with. For docstore sizes, this would need to be hard-coded as well. How to make sure this stays in sync if there are changes made to the dataset that would affect the docstore size? Maybe it's not a big deal as long as the number isn't super off.