apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.42k stars 951 forks source link

[Feature] Determine in advance whether there are changed fields, otherwise the latest schama information will not be obtained. #4521

Open GangYang-HX opened 2 days ago

GangYang-HX commented 2 days ago

Search before asking

Motivation

Optimize the logic of org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges: prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time

Solution

prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time

Anything else?

No response

Are you willing to submit a PR?

GangYang-HX commented 1 day ago

Background image image Here, each thread maintains a set of field information separately. If different threads process the same field one after another, the shema change check will be triggered. In this case, the latest schame information will be frequently obtained, resulting in a decrease in the overall throughput of the task, which will cause subsequent exceptions such as checkpoint failure.

Solution Maintain a state cache for the latest shema information to avoid direct access to the file system.

GangYang-HX commented 1 day ago

For example, the Paimon table has 1500 fields, the Parallelism of the Write operator is 500, and the task is restarted. In extreme cases, it will trigger 1500500 calls to the latest schema information. If each call takes 20ms, the total time is: 1500500*30ms=6.25h. This will greatly affect the throughput of the task.