I'm trying to troubleshoot an issue I'm having with the MySQL JDBC sink connectors that I seem to only be having when trying to sink to our Galera cluster. I've tried adjusting batch.size and max.poll.records up and down, but it doesn't seem to prevent these regular timeouts I'm seeing in the logs, which I think are causing the tasks to be inefficient. I see log entries like this regularly:
2023-07-07 23:39:18,464 INFO [jdbc.my.jdbc-sink|task-0] [Consumer clientId=connector-consumer-jdbc.my.jdbc-sink-0, groupId=connect-jdbc.my.jdbc-sink] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: (org.apache.kafka.clients.FetchSessionHandler) [kafka-coordinator-heartbeat-thread | connect-jdbc.my.jdbc-sink]
org.apache.kafka.common.errors.DisconnectException
2023-07-07 23:39:22,058 WARN [jdbc.my.jdbc-sink|task-0] WorkerSinkTask{id=jdbc.my.jdbc-sink-0} Commit of offsets timed out (org.apache.kafka.connect.runtime.WorkerSinkTask) [task-thread-jdbc.my.jdbc-sink-0]
30,000ms seems like a reasonable timeout setting to me, but I don't completely understand all of the mechanics here. Should I try increasing a timeout setting?
The database server cluster is hardware, on servers with 64 cores and 512GB RAM, and should be capable of handling a large volume of writes, so it doesn't make sense to me that these sink connectors are performing so much worse than they do when I use them on the relatively small RDS instances.
I tried increasing batch.size and max.poll as high as 12,000 each, but it makes latency go way up and although there seem to be more events processed when they are processed, it doesn't seem to affect overall number of events, and consumer lag will not improve. reducing them makes more consistent event processing, but again, same overall number, and always the above timeout exceptions. There is very little documentation for me to scour and so I'm grasping at straws.
I'm trying to troubleshoot an issue I'm having with the MySQL JDBC sink connectors that I seem to only be having when trying to sink to our Galera cluster. I've tried adjusting
batch.size
andmax.poll.records
up and down, but it doesn't seem to prevent these regular timeouts I'm seeing in the logs, which I think are causing the tasks to be inefficient. I see log entries like this regularly:30,000ms seems like a reasonable timeout setting to me, but I don't completely understand all of the mechanics here. Should I try increasing a timeout setting?
The database server cluster is hardware, on servers with 64 cores and 512GB RAM, and should be capable of handling a large volume of writes, so it doesn't make sense to me that these sink connectors are performing so much worse than they do when I use them on the relatively small RDS instances.
Here are the settings currently:
I tried increasing batch.size and max.poll as high as 12,000 each, but it makes latency go way up and although there seem to be more events processed when they are processed, it doesn't seem to affect overall number of events, and consumer lag will not improve. reducing them makes more consistent event processing, but again, same overall number, and always the above timeout exceptions. There is very little documentation for me to scour and so I'm grasping at straws.