delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

chore: migrate to pyo3 Bounds API #2596

Closed abhiaagarwal closed 2 weeks ago

abhiaagarwal commented 2 weeks ago

Description

This migrates the Python package to use the new pyo3 bounds-based API, which allows more control over memory management on the library side and theoretical performance improvements (I benchmarked, and didn't notice anything substantial). The old API will be removed in 0.22.

Related Issue(s)

Documentation

abhiaagarwal commented 2 weeks ago

Thanks mate! Fyi, pyo3-asyncio has been forked the main org, I'm going to be opening a PR later with an (experimental) python asyncio API. https://github.com/awestlake87/pyo3-asyncio/issues/126#issuecomment-2166729350

ion-elgreco commented 2 weeks ago

Thanks mate! Fyi, pyo3-asyncio has been forked the main org, I'm going to be opening a PR later with an (experimental) python asyncio API. https://github.com/awestlake87/pyo3-asyncio/issues/126#issuecomment-2166729350

Yeah sounds good, would be great if you can ade some benchmarks with it to see if there is any perf benefit to it

roeap commented 2 weeks ago

@abhiaagarwal, @ion-elgreco - a while ago we decided against using pyo3-asyncio (in case that is what you are planning :)) in favour of managing a long lived runtime like lancedb does. Unless there is a way to share the runtime that asyncio uses and the runtime we use to drive the main rust code, I would still be very hesitant to introduce pyo3-asyncio in this project.

That said, there would be a noticeable performance benefit to having a longer lived runtime in the python bindings.

see here for more context.

abhiaagarwal commented 2 weeks ago

@roeap absolutely — just a POC, I utilize pyo3 in an async context for most of my code (FastAPI) so I'm interested to see if there's any performance increases that I can use for "free". I already wrote a version of write_deltalake that's async — I'm now figuring out the best way of benchmarking it.

Indeed, that PR was forced to add a Arc on the internal Deltalake which I'm trying to avoid, so I'm only focusing on free functions at this moment where I can better reason about the state