elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.13k stars 123 forks source link

Bump Polars to v0.41.3 #917

Closed lkarthee closed 4 months ago

philss commented 4 months ago

We are almost there! But a problem regarding the cloud integration is still blocking this. I'm trying to fix it by updating object_store and our custom CloudWriter (Rust side), but I suspect that the problem is more complex due to some Polars' changes. I will keep trying :)

philss commented 4 months ago

There is a good news and a bad news. The good news is that we have only one failing test! 🎉 🥳 But the bad news is that we may need to submit changes to Polars in order to fix the failing test.

In a nutshell, the problem is on the LazyFrame#sink_parquet_cloud function from Polars that is failing probably because of the reasons I wrote in this issue: https://github.com/pola-rs/polars/issues/17172#issuecomment-2243328412

We can either wait for the fix to land on Polars, or deactivate the feature until there. This is going to affect people using the :streaming option of to_parquet/2 (cloud version). WDYT @josevalim @lkarthee @billylanchantin?


EDIT: I'm going to play with an idea to use the sink_parquet and another process to read the file, that can be a fifo/named pipe, and write to the cloud storage. I will let you know by the end of the day.

josevalim commented 4 months ago

What do we for when streaming is not supported? We raise or we just do the best we can? I am worried Polars is not ultimately interested in maintaining these features. We have another related PR that is open for quite some time.

philss commented 4 months ago

What do we for when streaming is not supported? We raise or we just do the best we can?

I think we can raise if it's not possible to support a given feature, but we can try to implement "hacks" when possible, until we need to remove the feature all together. Unfortunately we can't have our patches (via monkey patches) to the Polars codebase, and I don't think it's a good idea to maintain a fork to solve our issues. WDYT?

I am worried Polars is not ultimately interested in maintaining these features. We have another related PR that is open for quite some time.

Yeah, I'm worried too. Maybe it's not their focus to maintain this cloud integration in their Rust codebase.

josevalim commented 4 months ago

Yeah, let's raise for now and add a note to the CHANGELOG. Meanwhile, if you could send a PR to them, it would be great!

philss commented 4 months ago

@josevalim OK, I will add a note to the change log later. My PoC didn't work out, but I was wondering if we could still "enable streaming", but first stream to a temporary file, and them read from that file and send to the cloud. This could keep the convenience of saving to S3 right from Explorer, with the downside of requiring some space in disk, and probably extra time. WDYT?

Meanwhile, if you could send a PR to them, it would be great!

Yeah, I will try that! Something similar to our CloudWriter may work.

josevalim commented 4 months ago

Let’s just disable for now. People can stay in the old version until Polars is updated with a fix.

lkarthee commented 4 months ago

+1 for disabling and we should document it in the changelog.