Open a8356555 opened 4 days ago
What are the records generated by the MySQL CDC connector?
You are using upsert mode in FlinkSink.
In upsert mode when an update happens, Flink expects an unchanged primary key. Removes the old values for a given primary key, and insets a new record.
When the record is updated in a way that the primary key is changed, then it is not really an update in upsert mode. It should be a delete and an insert instead. It is the responsibility of the input stream to generate the correct records.
You can use write.upsert.enabled
set to false
if the MySQL connector is able to generate a retract stream.
What are the records generated by the MySQL CDC connector?
You are using upsert mode in FlinkSink.
In upsert mode when an update happens, Flink expects an unchanged primary key. Removes the old values for a given primary key, and insets a new record.
When the record is updated in a way that the primary key is changed, then it is not really an update in upsert mode. It should be a delete and an insert instead. It is the responsibility of the input stream to generate the correct records.
You can use
write.upsert.enabled
set tofalse
if the MySQL connector is able to generate a retract stream.
But my use case requires upsert, so in this scenario, using status as the partition key is not suitable, right?
But my use case requires upsert, so in this scenario, using status as the partition key is not suitable, right?
The problem is not with the partitioning. The problem is that you added status
to your PRIMARY KEY.
CREATE TABLE mysql_cdc_source
(
[..]
PRIMARY KEY (id) NOT ENFORCED
) WITH (
CREATE TABLE IF NOT EXISTS glue_catalog.my_db.my_table(
[..]
PRIMARY KEY (id,status) NOT ENFORCED
) PARTITIONED BY (
Apache Iceberg version
1.5.2
Query engine
Athena
Please describe the bug π
Hi,
I'm using MySQL Flink CDC with Iceberg 1.5.2 and Flink 1.16. I have a table partitioned by the status column, but this column is subject to updates. When an update occurs, I encounter duplicate records in the Iceberg table, which is not the desired behavior.
Is there a way to properly handle updates on a partition column in Iceberg to avoid duplicates?
here is the sql of my flink CDC
data before updating mysql:
data after updating mysql (duplicated row showed up):
get same results using spark 3.4:
query
glue_catalog.my_db.my_table.files
using spark 3.4Thanks!
Willingness to contribute