In scenarios where the number of Paimon table fields is large and the Write concurrency is high, reduce the Latest-Schema access frequency to improve the throughput of job cold start
Tests
Case-1: Observe whether the checkpoint time of schema evolution changes
Conclusion: After optimization, Schema Evolution is basically completed in seconds, or even milliseconds.
Case-2: Observe the log to see if there are still a large number of read schema behaviors
Conclusion: From hundreds of thousands to 115 times
Before the Schema Evolution operator calls org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges, add a judgment to confirm whether the field update really needs to be triggered.
Add a List variable to determine whether it is an updated column: List latestSchemaList
Add a state ListState. When the task is restored from the state, it is directly restored from here: ListState latestSchemaListState
Purpose
Linked issue: Issue-4521
In scenarios where the number of Paimon table fields is large and the Write concurrency is high, reduce the Latest-Schema access frequency to improve the throughput of job cold start
Tests
Case-1: Observe whether the checkpoint time of schema evolution changes Conclusion: After optimization, Schema Evolution is basically completed in seconds, or even milliseconds.
Case-2: Observe the log to see if there are still a large number of read schema behaviors Conclusion: From hundreds of thousands to 115 times
API and Format
org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunction#processElement
Documentation
Before the Schema Evolution operator calls org.apache.paimon.flink.sink.cdc.UpdatedDataFieldsProcessFunctionBase#extractSchemaChanges, add a judgment to confirm whether the field update really needs to be triggered.