deephaven / deephaven-core

Deephaven Community Core
Other
252 stars 80 forks source link

Native blob storage parquet reading support #4836

Open devinrsmith opened 10 months ago

devinrsmith commented 10 months ago

Many tools have the ability to read natively from blob storage; the operations on blob storage look similar to POSIX FS at a high layer, but at a low layer there is nuance that makes them quite different deserving of native integration (as opposed to a FUSE-style integration). This ability also allows single workers to reference data that is larger than memory or the local filesystem.

At a minimum, this involve implementations of a TableLocationKeyFinder for native blob storage APIs; likely an equivalent or extension of ParquetSingleFileLayout (with equivalent or extension of ParquetTableLocationKey).

AWS S3 as a first target is likely the best choice, as most of the other blob storage providers also offer S3-compatible APIs.

$ duckdb -s "SELECT COUNT(*) FROM read_parquet('s3://aws-public-blockchain/v1.0/btc/transactions/date=2023-11-13/part-00000-da3a3c27-700d-496d-9c41-81281388eca8-c000.snappy.parquet');" 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       428507 │
└──────────────┘
devinrsmith commented 10 months ago

https://github.com/awslabs/aws-java-nio-spi-for-s3 could be interesting if it proves hard to use existing blob storage APIs; although the performance probably wouldn't be as good as directly native integration.