databricks / iceberg-kafka-connect

Apache License 2.0
220 stars 49 forks source link

Coordinator behaviour when task.max > 1 and producer uses round-robin strategy due to null keys #290

Open anmol opened 2 months ago

anmol commented 2 months ago

Hi,

I am implementing a CDC pipeline from Oracle which has tables not having explicit primary keys. We are specifying the id columns in sink connector based data awareness(no constraint though) and the sink connector is able to work fine.

However, my concern is that the lack of primary key on source means null keys in Kafka and that the mutations on a source record (multiple Updates) are not guaranteed an ordering in Kafka. (Kafka producer behaviour) Then if we set task.max>1 in sink connector properties, the Updates on the same records may be processed by different tasks(workers) and in a different order.

Can there be a possibility that this results in an inconsistent behaviour during commit, like update ordering getting changed, due to coordinator committing second update in first batch and first update in subsequent commit?

cc @bryanck

Thanks in Advance.