datastax / cdc-apache-cassandra

Datastax CDC for Apache Cassandra
Apache License 2.0
35 stars 21 forks source link

Release all semaphore permits when the current task instance is to be retried #158

Closed aymkhalil closed 1 year ago

aymkhalil commented 1 year ago

This patch solves a problem on the agent when pulsar is temporarily unavailable and then recovers, task with partial failures could hang forever. Here is a sample timeline: [T1] Pulsar endpoint is not reachable due to a network problem [T2] A CDC enabled table continues to receive mutations. All mutations are sent to the cdc_raw for async processing [T3] Agent picks up the *.idx files form the cdc_raw and submits to the pendingTasks queue for execution. Although there is logical 1:1 mapping between a Task and a segment (represented by a pair of *.idx and *.log files in the commit log dir, the segment file is mutable. With each segment mutation, a new task is submitted for processing.
[T4] Task starts, it will read the log files and starts sending mutations one by one, up to available permits in the inflightMessagesSemaphore [T5] Because pulsar is not available, each task will wait forever until pulsar is back, there is controlled by the finish() logic [T6] pulsar becomes reachable again [T7] there are two possibilities depending on what happen in the inflight pulsar requests:

A thread dump was taken that is inline with the above findings, here is the interesting part: Thread dump

A unit test is added to mimic what happened during the thread dump as close as possible

Fill thread dump: https://jstack.review/?https://gist.github.com/aymkhalil/237de69bb919d13413df30c77570fd93