apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.37k stars 936 forks source link

[Question] What are the risks associated with the Java API? #3420

Open jk47 opened 4 months ago

jk47 commented 4 months ago

Search before asking

Motivation

https://paimon.apache.org/docs/0.8/program-api/java-api/ comes with a warning at the top

We do not recommend using the Paimon API naked, unless you are a professional downstream ecosystem developer, and even if you do, there will be significant difficulties.

If you are only using Paimon, we strongly recommend using computing engines such as Flink SQL or Spark SQL.

The following documents are not detailed and are for reference only.

Can you elaborate on the difficulties that will be encountered?

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

tsreaper commented 3 months ago

The main difficulty is to decide where you should use each class and call each method.

For example, consider a distributed system with one master node and several workers node. TableScan should only be used in master, while TableRead and TableWrite should only be used in workers. Also you need to design how to distribute Splits generated from TableScan to the workers. You also need to be careful with TableCommit because it can only run with 1 parallelism (otherwise the consistency guarantee is broken).

All in all, these things are exactly what you need to concern when designing a distributed system.