Open GangYang-HX opened 2 days ago
Background Here, each thread maintains a set of field information separately. If different threads process the same field one after another, the shema change check will be triggered. In this case, the latest schame information will be frequently obtained, resulting in a decrease in the overall throughput of the task, which will cause subsequent exceptions such as checkpoint failure.
Solution Maintain a state cache for the latest shema information to avoid direct access to the file system.
For example, the Paimon table has 1500 fields, the Parallelism of the Write operator is 500, and the task is restarted. In extreme cases, it will trigger 1500500 calls to the latest schema information. If each call takes 20ms, the total time is: 1500500*30ms=6.25h. This will greatly affect the throughput of the task.
Search before asking
Motivation
Optimize the logic of org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges: prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time
Solution
prioritize whether updatedDataFields is empty to avoid accessing the latest schema information every time
Anything else?
No response
Are you willing to submit a PR?