ankane / ruby-polars

Blazingly fast DataFrames for Ruby
MIT License
839 stars 30 forks source link

Feature: Add support for scanning parquet from cloud storage / S3 #37

Open catkins opened 11 months ago

catkins commented 11 months ago

(let me know if this is already possible with extra existing plumbing)

So in the python bindings for polars, it is able to do the more optimised byte-range trickery for working with parquet files in S3, plus support for scanning directories of parquet files in S3.

https://pola-rs.github.io/polars/user-guide/io/cloud-storage/#scanning-from-cloud-storage-with-query-optimisation

I'm also happy to help contribute this at some point if you think it'd be worthwhile.

ankane commented 11 months ago

Hi @catkins, happy to include this if you want to submit a PR (it's essentially porting code from polars-py).

catkins commented 11 months ago

Awesome, I'll see how I go.

I'm looking at the relevant bit of py-polars, so I'll see if I can plumb it through in a similar way.

https://github.com/pola-rs/polars/blob/main/py-polars/src/lazyframe.rs/#L258-L313

ankane commented 11 months ago

Looking at it more, it'd require including a TLS library in Rust (for HTTPS connections), which isn't something I'd like to do right now, so think it'd be better to try to do this outside of the gem with the Ruby S3 client.

catkins commented 11 months ago

To clarify I grok how the tls lib is getting pulled in:

it'd be better to try to do this outside of the gem with the Ruby S3 client

So that was actually my first approach, but I got tripped up on LazyFrame#new_from_parquet only accepting a file path and not able to pass in some kind of IO type.

https://github.com/ankane/polars-ruby/blob/f47f67971ad0af39a8dbd4c8e50d0d6a9f662f25/ext/polars/src/lazyframe.rs#L146

Happy to hear if I was parking up the wrong tree though...

catkins commented 11 months ago

I'm most of the way through the first part of plumbing in the aws feature and wiring it through to lazyframe locally, so I can throw up a fork to look at and chat about after I get it more solid.

Failing that, I guess my fallback for doing what I was hoping to do was using the red-arrow gems, but polars-ruby was much more user friendly to use and install.

catkins commented 10 months ago

To satisfy my own curiosity, I did a rough-cut PR implementing it.

If you're not keen on supporting it at this time, thats understandable too.

nicosuave commented 10 months ago

I would also enjoy this addition if @ankane would consider. (Can you tell I'm spoiled by DuckDB?)

DeflateAwning commented 7 months ago

We're considering doing this exact thing; would be awesome if reliable support for this feature was added!

catkins commented 7 months ago

We've got a few use cases that having this would simplify for us too, so I'd love to have it in the library and leveraging the rust implementations.

nicosuave commented 7 months ago

Good ol' "reverse ETL"; same situation here