apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.44k stars 2.23k forks source link

How can I use Iceberg in C++ #5122

Open THUMarkLau opened 2 years ago

THUMarkLau commented 2 years ago

I checked the usage of Iceberg on the Internet and only found Java API and Python API. If I were to query Iceberg data in C++, would I have to write my own code to parse Iceberg files?

Thanks!

samredai commented 2 years ago

Hi @THUMarkLau, currently there is no C++ Iceberg client but it's been recently discussed on the Iceberg dev list and a number of people are in support of developing one (as well as a Rust client). It's an effort that would take some time and significant community support, as well as a compatibility framework as @emkornfield suggested.

a49a commented 2 years ago

In the future, we can support a Rust SDK.

kbendick commented 2 years ago

I am also +1 on a Rust SDK.

mrendi29 commented 2 years ago

+1 for rust sdk

JanKaul commented 2 years ago

+1 for rust SDK

snth commented 2 years ago

+1 for Rust SDK

openinx commented 2 years ago

For more discussion about introducing C++ SDK & Rust SDK, please see the original mail list: https://lists.apache.org/thread/lf8gw4yk9c6l580o6k7mobg2y91rpjvp

snth commented 2 years ago

@openinx Thanks for the link but that discussion is also a few months old and I didn't really see a conclusion.

My mostly uninformed opinion is that there currently is a battle in this space for the predominant table format between Iceberg and Delta. From what I've seen Iceberg looks like the better format but is less well supported. Delta has good Rust support. I'd like to set up a small scale Data Lakehouse with Iceberg and DuckDB but I don't want to go near a JVM. Having a good Rust implementation with a nice Python wrapper would do wonders here.

JanKaul commented 2 years ago

I'm currently trying to extend the existing https://github.com/oliverdaff/iceberg-rs repo to get a working rust iceberg implementation. If anyone wants to contribute, it would be awesome. Just getting some input on the architecture would be helpful.

openinx commented 2 years ago

I'd like to set up a small scale Data Lakehouse with Iceberg and DuckDB but I don't want to go near a JVM

@snth I'd like to see what's the scenario that you want to access iceberg table by non-jvm tools (if you don't mind).

I agree that it's great thing to support iceberg in rust or other native languages. From the current discussion, I see risingwave and starrocks are requiring to integrate into iceberg datalake so they would be very happy to see the rust support.

purefunctions commented 2 years ago

@JanKaul Just got linked to this conversation from apache arrow discord chat! I just (today) created my own project https://github.com/trust-in-rust/rustberg with a modest goal of support a small subset of iceberg operations through the datafusion project. The iceberg-rs library that you linked was the first one that I saw, but it only supports metadata files for now, like you discovered. I also found https://github.com/joshuarobinson/rust_iceberg which actually reads more than metadata and supports a query through datafusion. However, it doesn't support the partition/schema evolution based split planning that one would have to do when using iceberg and evolving tables.

@openinx just like @snth We also have a very small scale data lake based on iceberg on-premises. We use iceberg to make it easier to migrate the lake to a cloud later. We use Hive Metastore to store the latest metadata files and use NFS for table storage. We use spark (single node is sufficient for our needs) to ingest data and provide a python library for end users to query the data. The query needs are modest and the targeted query is mostly handled by the partitioning scheme and the resultant data that needs to be further queried can easily fit in one node. Currently we use pyspark in the user library and the startup time of JVM/spark and in general the slowness introduced by pyspark is something that we'd like to reduce. Hence looking at rust for a iceberg-lite kind of query API based on datafusion

JanKaul commented 2 years ago

Hey @purefunctions, it's cool to hear that you are also working on a rust iceberg implementation. In the meantime I extended the iceberg-rs repo with a catalog trait, a postgres catalog, reading of manifest and manifest_list files. And I'm currently also working on a reader for datafusion. You can find the source at my fork of iceberg-rs: https://github.com/JanKaul/iceberg-rs. I created a couple of PRs for https://github.com/oliverdaff/iceberg-rs but I haven't received any response.

To avoid duplicate efforts I would like for all of us to focus on one repo. I don't mind which one.

purefunctions commented 2 years ago

Hi @JanKaul I like the idea of pooling resources and focussing on one implementation. I work for a company that's new to open source and wants me to get approval for personal contributions to open source as well. I've started that process for my repo. So, I'd have a slight preference to use that repo. This, however, is not a deal breaker.

I'd like to understand where our goals align so that the combined effort is beneficial for more than one person! How do people discuss the project goals in the open source world? Do they meet online in a chat (I'm in US Pacific Time Zone)? Sorry for newbie questions. Like I mentioned, I'm new to open source (as well as, to some extent, Rust)!

JanKaul commented 2 years ago

Sounds great. I guess the best places to discuss further topics are the Iceberg or Datafusion slack channels.

Regarding which repo to use, I would use the most feature complete repo and start from there. But I don't really mind. I hope that eventually the repo will become governed by apache.

zhjwpku commented 2 years ago

+1 for C++ SDK

alexey-milovidov commented 2 years ago

We need C++ SDK for ClickHouse. Alternatively, we can use a Rust SDK if it will be available.

ucasfl commented 1 year ago

+1 for C++ SDK

denniean commented 1 year ago

+1 for Rust SDK. Are there any plans for it already in the Iceberg's roadmap? I could join someone and put some effort into working on it.

JanKaul commented 1 year ago

I have created a new repo https://github.com/JanKaul/iceberg-rust that I'm actively working on. It supports reading and writing of metadata, manifest and data files as well as preliminary read support for Datafusion.

For those interested in a C++ SDK, I created C bindings for the rust SDK at https://github.com/JanKaul/iceberg-rust/blob/main/iceberg-c/README.md.

denniean commented 1 year ago

@JanKaul thanks for your endeavors🙏🏻 if you will need additional help please feel free to reach me out

JanKaul commented 1 year ago

If I may ask: what query engine would you like to use with iceberg?

snth commented 1 year ago

I personally think DataFusion is great. I'd also like to see DuckDB (I believe this may be coming from their side).

@JanKaul what's still missing from your project? If I start with a bunch of csv files, what do I need in addition to iceberg-rust to get a small iceberg service up and running? I'm guessing an object store like Minio, and apart from that?

JanKaul commented 1 year ago

The biggest missing piece in iceberg-rust is writing to iceberg. I will try to work on this in the next days.

Regarding your storage backend you can use an object-store like S3, minio or use a local filesystem.

If you want to use an object-store like S3 or minio you should use an iceberg catalog to register your tables. At the moment there is preliminary support for the rest catalog.

I will try to get an example working and will share it here.

alexey-milovidov commented 1 year ago

We have added support for Iceberg to ClickHouse, without the SDK, but it is quite limited - no updates and the metadata support is sketchy. The SDK, either C++ or Rust will help a lot.

JanKaul commented 1 year ago

You probably wrote the iceberg engine in C++, right? It would be great if we could somehow combine the effort.

To be honest iceberg-rust is also missing a lot of features. Updates, compaction, ... is all not supported at the moment.

denniean commented 1 year ago

I think it would be great to bring the topic to community sync to align on the strategy about Rust/C++ SDK.

Some contributors participated in the discussion already. So I'll tag you if you don't mind.

@samredai @rdblue do you think it would be possible to get attention to the topic during the upcoming Iceberg community sync? Thanks

samredai commented 1 year ago

@denniean there's a community sync this Wednesday, feel free to add this as an agenda topic in the Iceberg Community Sync doc. To really get either a Rust or C++ SDK off the ground, my thoughts are that it's important to get someone invested to volunteer to shepherd the effort. A dedicated sync on some cadence would probably be useful as well.

JanKaul commented 1 year ago

Yeah, I'll gladly join the community sync.

JanKaul commented 1 year ago

The iceberg community is having a meeting to discuss the strategy for a C++/Rust SDK on April 19th 2023 at 16:00 UTC. You can join the meeting with the following link: https://meet.google.com/ueb-vmvx-hdw

denniean commented 1 year ago

@JanKaul thank you for the initiative. Unfortunately, I could not join the last community sync, but I will participate in the discussion this Wednesday.

JanKaul commented 1 year ago

For those who missed the meeting, here is the recording: https://drive.google.com/file/d/15Ak3y0LrnnadmeSt9jKa9B7lA-GDttfY/view?usp=sharing

tbragin commented 1 year ago

+1 on c++ client library. Have there been any recent conversations on this topic?

MRocholl commented 5 months ago

just stumbled across delta-kernel-rs. In my opinion this would also be what iceberg requires. One project to enable queries from polars, duckdb, wasm.... A central library that can be included almost everywhere. Has this been discussed yet?