delta-io / kafka-delta-ingest

A highly efficient daemon for streaming data from Kafka into Delta Lake
Apache License 2.0
337 stars 72 forks source link

Add seek-offsets config to set the starting point for the kafka ingestion #92

Closed mosyp closed 2 years ago

mosyp commented 2 years ago

The current approach with relying on kafka seeks is unreliable since consumer could ingest topics from earliest before the actual seek and as such corrupt the delta store. Just adding the filtering on the consumed offsets/stored offsets as with txn actions is not enough since there's more places where it could crash, e.g. the rebalance event with writers reseek and state clearance.

As such, we're moving towards the well tested and bullet proof usage of txn actions which are guarded by delta protocol and optimistic concurrency loop with dynamodb locking.

The introducing changes will write first the delta log version with txn actions for each partition that are given by startingOffsets param. This is done only once by the very first succeeding writer. Once this is achieved, then the delta table is protected from duplicates and kafka seek inconsistency by delta protocol.

mosyp commented 2 years ago

Example logs from the unit tests

running 1 test
[INFO] Writing starting offsets [0:5,1:10] to delta table ./tests/gen/table-44a3f59f-962d-4194-919c-49644244ac7d
[INFO] Delta version 1 completed with starting offsets 0:5,1:10.
[INFO] Writing starting offsets [0:5,1:10] to delta table ./tests/gen/table-44a3f59f-962d-4194-919c-49644244ac7d
[INFO] The starting offsets are already applied for table ./tests/gen/table-44a3f59f-962d-4194-919c-49644244ac7d
[INFO] Writing starting offsets [0:15] to delta table ./tests/gen/table-44a3f59f-962d-4194-919c-49644244ac7d
[ERROR] Stored offsets for partitions [0] are lower than given starting offsets in table ./tests/gen/table-44a3f59f-962d-4194-919c-49644244ac7d
test starting_offsets::tests::write_starting_offsets_test ... ok
mosyp commented 2 years ago

Leaving it as a draft for earlier approach review and will finish the integration a little bit later