apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] is it possible to read/write hudi files with another programming language? #7446

Closed schlichtanders closed 1 month ago

schlichtanders commented 1 year ago

Hi,

I am curious about the state of hudi. We are currently using it via Spark, however thinking about switching to another language.

Is it possible to write Hudi files via C, C++, Rust, or anything? Or is it completely tied to Spark/Flink?

Thank you very much for your help

codope commented 1 year ago

Not yet, but it's planned for version 1.0.0. https://hudi.apache.org/roadmap/

Currently, one can use Hudi with Python (pyspark), Java and Scala.

schlichtanders commented 1 year ago

Thank you for the pointer to the roadmap. Some C/Rust implementation would be nice for the entire LLVM ecosystem. I myself am looking forward to use Julia together with Hudi some day in the future. (Julia also compiles via LLVM, so a C binding would be optimal).

As the 1.0.0 may still be far in the future, is the java API also accessible outside from Apache Spark? I mean as a pure java library, which could be loaded by some other languages?

yihua commented 1 year ago

Hi @schlichtanders Hudi has the pure Java API for writing tables through HoodieJavaWriteClient. You can check the examples in HoodieJavaWriteClientExample.

I'll close this issue for now. Feel free to reopen the issue if you have more questions.

schlichtanders commented 1 year ago

@yihua is there also a ReadClient? An example would also be great.

schlichtanders commented 1 year ago

@yihua

cheunhong commented 7 months ago

Hudi is certainly lacking behind in native support on other languages, Iceberg and Delta already have some pretty nice libraries such as delta-rs and pyiceberg for reading and writing files without a JVM.

schlichtanders commented 7 months ago

Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming is super attractive for me. Neither delta-rs nor iceberg have it as far as I knew...

yihua commented 6 months ago

Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming is super attractive for me. Neither delta-rs nor iceberg have it as far as I knew...

@schlichtanders @cheunhong I missed this discussion. We are considering different language support. If you have a use case I’d love to chat with you about that and see how the use case can be better supported.

We have an experimental PR on read support in Python: #8768 . We have also introduced a Hudi file group reader to make read integration in engines easier.

schlichtanders commented 6 months ago

For me Python is actually not the problem - via Spark and Flink it is pretty well supported.

My use case is to use the modern programming language Julia directly, without the JVM inbetween, because the language itself is high performant and has distributed computing support. A perfect match for working with Hudi both as big data as well as streaming. Hence it would be great if Hudi is accessible also without Spark and Flink, i.e. without JVM.

rubenatterbury commented 5 months ago

I know I was looking into a Rust implementation due to the work that's happening on pg_analytics by ParadeDB, where they purely had to choose delta-rs due to being dependent on Rust tooling to create the Postgres extension. The use case in this instance is that theoretically, if you integrate Hudi (or like they are doing, Delta Lake) as a Postgres extension you can very easily offload data directly on to your data lake to transition to a lakehouse architecture much more easily and avoid having to use external ETL tooling.

A lot of the OSS work being done by Materialize.com , Neon,tech , DataBend is all happening in Rust so theoretically if Hudi could integrate with modern development happening in Rust it could be a big win for the ecosystem I imagine.

vinothchandar commented 4 months ago

@xushiyan do you want to share the budding hudi-rs and python bindings here, to see if anyone wants to chip in for contributions

vinothchandar commented 4 months ago

https://github.com/xushiyan/hudi-rs has some basic reads with datafusion?

xushiyan commented 4 months ago

@vinothchandar yes. gonna take care of repo logistics and dev setup to make the repo ready for new contributors. Also preparing issues to work on.

xushiyan commented 1 month ago

@rubenatterbury @schlichtanders @cheunhong we have officially released hudi-rs 0.1.0 ! https://github.com/apache/hudi-rs