Open daniel-crlabs opened 3 years ago
It appears that that was happening during the backfill? @daniel-crlabs can you confirm?
My apologies, I just saw this message. How would I be able to tell when this happened?
It appears that that was happening during the backfill? @daniel-crlabs can you confirm?
This is what the customer stated when they originally created the ticket, so I assume yes:
We have a cluster that is not currently receiving any traffic. There is a table with 4,536,254,464 rows. We enabled CDC changefeed on the table on 4/21. We are seeing approximately 400,000k messages per second emitted by the CRDB cluster to Kafka. At that rate, we calculate the entire backfill should be emitted in 11340.63616 seconds (or significantly less than one day). We are, however, still seeing messages emitted to Kafka over 10 days later at the same rate.
We are hitting this too. We have a table with ~15M rows and the backfill never completes and the kafka topic has over 15 billion messages with multiple duplicates.
What is the recommendation for this?
In my particular case, the error from Kafka is that the message size is too big. I increased the message size, but I never know how large a row could be and I don't know what to set the message size to be.
Hi guys - wanted to follow up on this. We're looking into product-level ways to address invalid message. I'd love to get your feedback on some features we are considering.
Do you think we can close this issue w/ recent dynamic batching?
@miretskiy Is this particular issue fixed though? Does CDC still continually retry and never finish if there's a Kafka error?
22.2.0 has dynamic batching. CDC still continuously retries, but it automatically reduces batch size if necessary.
@miretskiy, what is the new behavior? Will it retry failed batch, or will restart the table with a smaller batch?
it will retry failed batch -- with fewer messages.
@kmanamcheri still retries (at this point forever) since we can't know if the error is transient. This will change in the near future.
Describe the problem
Client enabled CDC changefeed into Kafka for a table with more than 4 billion rows. Job never completes It appears potentially some rows are too large for the sink to accept. CRDB does not proceed past that row and it'll keep retrying the backfill non stop.
To Reproduce
Analyzed logs and noticed a pattern and error messages indicating the problem.
Expected behavior Either complete the job or fail, instead of being stuck in an endless loop.
Additional data
Job status, notice high_water_timestamp is 0 since April 21st, which means this job hasn't started yet to send data to kafka.
CRDB errors from Kafka:
After a few minutes the operations seem to fail
Eventually the same process repeats itself. It seems to start up but eventually dies again:
this only goes on for a few seconds, then the following start showing up again:
and then a few minutes later:
Environment:
Additional context What was the impact?
Can't send data from CRDB to Kafka
gz#8388
Jira issue: CRDB-7494