Closed miretskiy closed 2 years ago
Hi @miretskiy, please add branch-* labels to identify which branch(es) this release-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
cc @cockroachdb/cdc
From @samiskin:
Yeah looking at async_producer.go it seems like Sarama will keep track of whats errored and block emitting it while its retrying, but once its failed it'll just fail the broker and later a new one would get created for that partition and it'd lose the map of what has failed.
I don't know if theres even a way to deal with this rigorously using Sarama, as it looks like its still possible that between the time Sarama does the "run out of retries for k@t1 and mark the broker as failed and put the error in Errors()" and "we see the error on Errors() and set flushErr", we've already done "insert k@t2 into Input -> new brokerproducer is made with no knowledge of the failure -> k@t2 is emitted".
Seems like Sarama only guarantees "Order [within partitions] is preserved even when there are network hiccups and certain messages have to be retried."[1], but doesn't guarantee ordering under messages failing.
Removing GA blockers, now that backports merged.
While testing kafka batch retries, the following behavior was observed:
1 topic create with max message size 128K Changefeed created, with somewhat insane configuration:
Each row in tpcc.history is about 216 bytes -- thus, only ~600 messages could ever fit. The following error messages were observed in the logs:
The last message
kafka sink abandoning internal retry due to error: ‹kafka: Failed to deliver 5641 messages.>
is particularly disconcerting. If I'm reading the code correctly, when that happens, the error returned up the stack: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/sink_kafka.go#L518-L518So, the good news is that we do set flushErr; the bad news is that we clear out retryBuf (in endInternalRetry) and we clear out retry error:
So, at this point, we have swallowed 5641 messages, and set flushErr; Now, clearing out retryErr above causes isRetrying function to return false:
This is very bad because now, new messages can come in -- that's already fatally broken since we might be emitting newer key version never having emitted previous version (e.g. it was among those 5641 messages).
The changefeed not exiting with an error is highly problematic. Best guess, and that's just a guess is that change aggregator does attempt to flush the sink, and it probably does return flushErr, but that error probably gets swallowed here:
It appears this issue is very severe, and that batching retries are entirely broken.
Unfortunately, it appears that disabling this feature
changefeed.batch_reduction_retry_enabled=false
does not work either -- for the same reason as above -- it seems we rely on flushErr , which is wrong. I'm probably missing some details here (and I hope I'm wrong!), but I think at this point, we have to consider kafka entirely broken.Jira issue: CRDB-20557
Epic CRDB-11783