ankane / ruby-polars

Blazingly fast DataFrames for Ruby
MIT License
852 stars 33 forks source link

Feature: Add support for scanning parquet from cloud storage / S3 #37

Open catkins opened 1 year ago

catkins commented 1 year ago

(let me know if this is already possible with extra existing plumbing)

So in the python bindings for polars, it is able to do the more optimised byte-range trickery for working with parquet files in S3, plus support for scanning directories of parquet files in S3.

https://pola-rs.github.io/polars/user-guide/io/cloud-storage/#scanning-from-cloud-storage-with-query-optimisation

I'm also happy to help contribute this at some point if you think it'd be worthwhile.

ankane commented 1 year ago

Hi @catkins, happy to include this if you want to submit a PR (it's essentially porting code from polars-py).

catkins commented 1 year ago

Awesome, I'll see how I go.

I'm looking at the relevant bit of py-polars, so I'll see if I can plumb it through in a similar way.

https://github.com/pola-rs/polars/blob/main/py-polars/src/lazyframe.rs/#L258-L313

ankane commented 1 year ago

Looking at it more, it'd require including a TLS library in Rust (for HTTPS connections), which isn't something I'd like to do right now, so think it'd be better to try to do this outside of the gem with the Ruby S3 client.

catkins commented 1 year ago

To clarify I grok how the tls lib is getting pulled in:

it'd be better to try to do this outside of the gem with the Ruby S3 client

So that was actually my first approach, but I got tripped up on LazyFrame#new_from_parquet only accepting a file path and not able to pass in some kind of IO type.

https://github.com/ankane/polars-ruby/blob/f47f67971ad0af39a8dbd4c8e50d0d6a9f662f25/ext/polars/src/lazyframe.rs#L146

Happy to hear if I was parking up the wrong tree though...

catkins commented 1 year ago

I'm most of the way through the first part of plumbing in the aws feature and wiring it through to lazyframe locally, so I can throw up a fork to look at and chat about after I get it more solid.

Failing that, I guess my fallback for doing what I was hoping to do was using the red-arrow gems, but polars-ruby was much more user friendly to use and install.

catkins commented 11 months ago

To satisfy my own curiosity, I did a rough-cut PR implementing it.

If you're not keen on supporting it at this time, thats understandable too.

nicosuave commented 11 months ago

I would also enjoy this addition if @ankane would consider. (Can you tell I'm spoiled by DuckDB?)

DeflateAwning commented 8 months ago

We're considering doing this exact thing; would be awesome if reliable support for this feature was added!

catkins commented 8 months ago

We've got a few use cases that having this would simplify for us too, so I'd love to have it in the library and leveraging the rust implementations.

nicosuave commented 8 months ago

Good ol' "reverse ETL"; same situation here

jeromepl commented 1 month ago

@ankane at work we also tried to do this outside of this gem but got stuck on the same limitation as @catkins that the scan_parquet method does not accept IO objects.

We can get around it by using a TempFile but then there is no way to do partial reads from S3 in order to process larger-than-memory datasets (or just to improve performance by limiting network usage).

I'm of the same opinion as others that this gem should match the python library in terms of functionality and therefore that we should add support for connecting to S3 directly in this gem.