delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.35k stars 413 forks source link

deltalake-python: Missing support for adding new factories and logstores #2818

Open MartinKolbAtWork opened 3 months ago

MartinKolbAtWork commented 3 months ago

Description

The Python binding has a hard-coded list of deltalake handlers that are registered: https://github.com/delta-io/delta-rs/blob/fcd62ab10be9545eab296f03ae0e4324fc4be6cb/python/src/lib.rs#L2035-L2039

To add support for another object store (SAP BTP) we have a Rust crate available, we did not find a way to register these handlers onto the already existing Python binding. The shared library that comes with deltalake-python does not expose an entry point for adding new object stores. We ended up in forking delta-rs and adding the registration call as another line in the list above. But we don’t think using a fork to add an additional object store is an appropriate approach.

Have we missed something here? Shouldn’t there be a way to add additional stores in addition to the 5 existing ones?

ion-elgreco commented 3 months ago

I'll take a look on how an api should look like to expose and register an external handler through python

ion-elgreco commented 3 months ago

@MartinKolbAtWork it seems that this might be not possible or quite complex, I at least can't find any docs in Pyo3 to achieve this

Do you have the SAP BTP Object store published somewhere?

MartinKolbAtWork commented 3 months ago

Hi @ion-elgreco , Thanks for looking into this. The integration for SAP BTP is currently used internally at SAP and might be published later, however currently I cannot share the code. It’s actually using “SAP Data Lake Files” (https://help.sap.com/docs/hana-cloud-data-lake/user-guide-for-data-lake-files/understanding-data-lake-files) as object storge, which is accessible via SAP’s Business Technology Platform (BTP, https://www.sap.com/products/technology-platform.html).

I also investigated a possible solution and it’s especially challenging because the shared library packaged with the Wheel of deltalake-python would need binary compatibility with the shared library that would be packaged with the “add-on”. Ensuring the binary compatibility between these libraries (e.g. related to the used Rust version and the used version of the deltalake crate) would be hard to achieve. An approach that “tunnels” all calls between the two Rust libraries over Python could mitigate the binary compatibility issues but would probably suffer from poor performance.

Xuanwo commented 2 months ago

Hello, I'm from the OpenDAL community, which aims to provide storage access to various services in multiple languages. Perhaps we can build something extensible to allow us to integrate with more storage services easily.

Tools we have now:

ion-elgreco commented 2 months ago

@Xuanwo hey, I wasn't aware that opendal has an objectstore Impl, that's useful!

Any help on this is much appreciated :)