Is your feature request related to a problem? Please describe.
With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them.
For example, the Common Crawls are distributed on S3 buckets.
Describe the solution you'd like
To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend".
So for example, instead of opening a file from location /path/to/file, we could open files from s3.example.com/bucket/path/to/file.
Describe alternatives you've considered
Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.
Additional context
I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.
Is your feature request related to a problem? Please describe. With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them. For example, the Common Crawls are distributed on S3 buckets.
Describe the solution you'd like To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend". So for example, instead of opening a file from location
/path/to/file
, we could open files froms3.example.com/bucket/path/to/file
.Describe alternatives you've considered Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.
Additional context I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.