delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

chore: bump to datafusion 39, arrow 52, pyo3 0.21 #2581

Closed abhiaagarwal closed 2 weeks ago

abhiaagarwal commented 2 weeks ago

Description

Updates the arrow and datafusion dependencies to 52 and 39(-rc1) respectively. This is necessary for updating pyo3.

While most changes with trivial, some required big rewrites. Namely, the logic for the Updates operation had to be rewritten (and simplified) to accommodate some new sanity checks inside datafusion: (https://github.com/apache/datafusion/pull/10088).

Depends on delta-kernel having its arrow and object-store version bumped as well. This PR doesn't include any major changes for pyo3, I'll open a separate PR depending on this PR.

Related Issue(s)

Documentation

abhiaagarwal commented 2 weeks ago

This is now ready for review. All python and rust tests are passing locally on my machine (docs seem to have issues building, though). This only enables the gil-refs feature in the python bindings, it does not migrate to the new API (will be done in a follow-up PR)

ion-elgreco commented 2 weeks ago

Nice work! @abhiaagarwal

abhiaagarwal commented 2 weeks ago

Thank you @ion-elgreco! DF 39 should be released later this week, and the upstream delta-kernel needs to be resolved as well. I'll open a follow-up PR for moving to the new pyo3 lifetimes, and I'm also interested in implementing an async-native API using pyo3-asyncio since a good chunk of my work code can benefit from it :)

ion-elgreco commented 2 weeks ago

@abhiaagarwal yeah leaving it unmerged for now. Also checking with rest on slack if they are fine with keeping the git reference of delta-kernel-rs for time being.

Great, you can ping me when you got the new bound APIs working!

Just two things regarding pyo3-asyncio. It seems the original maintainer is not very active anymore, so it's still locked to pyo3 0.20 unfortunately.

The other thing is, where do you need it for? :) as most of the async stuff is internal

abhiaagarwal commented 2 weeks ago

@ion-elgreco yeah, but it appears that the maintainer of pyo3 is interested in taking stewardship over it, so maybe that'll be resolved soon.

I'm mostly interested to see if there are any perf improvements. Most of the code I use with the python deltalake package in already is async native (ie. async main), having async methods won't "infect" the rest of our codebase. We use it as a pseudo backend with FastAPI, for reference. There appears to be work being done in the main pyo3 repo that'll allow the gil to be released in async methods, so we could have the best of both worlds.

ion-elgreco commented 2 weeks ago

@abhiaagarwal can you bump delta-kernel-rs to 0.1.1, it includes your changes now. And also bump the python cargo toml to 0.18.1?

abhiaagarwal commented 2 weeks ago

@ion-elgreco sure, I'll do it by EOD. Also will bump DF since they released 39

abhiaagarwal commented 2 weeks ago

Bumped to published versions, also added a new test that seemingly confirms that the issues related to decimals in #1778 et al on both engines seem to be fixed

ion-elgreco commented 2 weeks ago

Bumped to published versions, also added a new test that seemingly confirms that the issues related to decimals in #1778 et al on both engines seem to be fixed

Yes I was waiting for some time for the arrow bump since they added Scientific notation support for decimals :)

Thanks for adding the test! 🙂