allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

S3 or other file IO backends #217

Open heinrichreimer opened 1 year ago

heinrichreimer commented 1 year ago

Is your feature request related to a problem? Please describe. With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them. For example, the Common Crawls are distributed on S3 buckets.

Describe the solution you'd like To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend". So for example, instead of opening a file from location /path/to/file, we could open files from s3.example.com/bucket/path/to/file.

Describe alternatives you've considered Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.

Additional context I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.

seanmacavaney commented 1 year ago

The current structure actually already supports this! All you'd need to do is build a RequestsDownload object that isn't wrapped in Cache.