allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
321 stars 43 forks source link

S3 or other file IO backends #217

Open janheinrichmerker opened 1 year ago

janheinrichmerker commented 1 year ago

Is your feature request related to a problem? Please describe. With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them. For example, the Common Crawls are distributed on S3 buckets.

Describe the solution you'd like To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend". So for example, instead of opening a file from location /path/to/file, we could open files from s3.example.com/bucket/path/to/file.

Describe alternatives you've considered Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.

Additional context I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.

seanmacavaney commented 1 year ago

The current structure actually already supports this! All you'd need to do is build a RequestsDownload object that isn't wrapped in Cache.