Many tools have the ability to read natively from blob storage; the operations on blob storage look similar to POSIX FS at a high layer, but at a low layer there is nuance that makes them quite different deserving of native integration (as opposed to a FUSE-style integration). This ability also allows single workers to reference data that is larger than memory or the local filesystem.
At a minimum, this involve implementations of a TableLocationKeyFinder for native blob storage APIs; likely an equivalent or extension of ParquetSingleFileLayout (with equivalent or extension of ParquetTableLocationKey).
AWS S3 as a first target is likely the best choice, as most of the other blob storage providers also offer S3-compatible APIs.
https://github.com/awslabs/aws-java-nio-spi-for-s3could be interesting if it proves hard to use existing blob storage APIs; although the performance probably wouldn't be as good as directly native integration.
Many tools have the ability to read natively from blob storage; the operations on blob storage look similar to POSIX FS at a high layer, but at a low layer there is nuance that makes them quite different deserving of native integration (as opposed to a FUSE-style integration). This ability also allows single workers to reference data that is larger than memory or the local filesystem.
At a minimum, this involve implementations of a
TableLocationKeyFinder
for native blob storage APIs; likely an equivalent or extension ofParquetSingleFileLayout
(with equivalent or extension ofParquetTableLocationKey
).AWS S3 as a first target is likely the best choice, as most of the other blob storage providers also offer S3-compatible APIs.