apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 954 forks source link

[paimon-flink-cdc] Add the latest_schema state at schema evolution operator ,Reduce the latest schema access frequency #4535

Open GangYang-HX opened 6 days ago

GangYang-HX commented 6 days ago

Purpose

Linked issue: Issue-4521

In scenarios where the number of Paimon table fields is large and the Write concurrency is high, reduce the Latest-Schema access frequency to improve the throughput of job cold start

Tests

Case-1: Observe whether the checkpoint time of schema evolution changes image image image Conclusion: After optimization, Schema Evolution is basically completed in seconds, or even milliseconds.

Case-2: Observe the log to see if there are still a large number of read schema behaviors image Conclusion: From hundreds of thousands to 115 times

API and Format

org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunction#processElement

Documentation

Before the Schema Evolution operator calls org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges, add a judgment to confirm whether the field update really needs to be triggered.

  1. Add a List variable to determine whether it is an updated column: List latestSchemaList
  2. Add a state ListState. When the task is restored from the state, it is directly restored from here: ListState latestSchemaListState