confluentinc / kafka-connect-jdbc

Kafka Connect connector for JDBC-compatible databases
Other
1.01k stars 953 forks source link

MySQL JDBC connector timeouts: Cancelled in-flight FETCH request errors #1352

Open jslusher opened 12 months ago

jslusher commented 12 months ago

I'm trying to troubleshoot an issue I'm having with the MySQL JDBC sink connectors that I seem to only be having when trying to sink to our Galera cluster. I've tried adjusting batch.size and max.poll.records up and down, but it doesn't seem to prevent these regular timeouts I'm seeing in the logs, which I think are causing the tasks to be inefficient. I see log entries like this regularly:

2023-07-07 23:39:18,464 INFO [jdbc.my.jdbc-sink|task-0] [Consumer clientId=connector-consumer-jdbc.my.jdbc-sink-0, groupId=connect-jdbc.my.jdbc-sink] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: (org.apache.kafka.clients.FetchSessionHandler) [kafka-coordinator-heartbeat-thread | connect-jdbc.my.jdbc-sink]
org.apache.kafka.common.errors.DisconnectException
2023-07-07 23:39:22,058 WARN [jdbc.my.jdbc-sink|task-0] WorkerSinkTask{id=jdbc.my.jdbc-sink-0} Commit of offsets timed out (org.apache.kafka.connect.runtime.WorkerSinkTask) [task-thread-jdbc.my.jdbc-sink-0]

30,000ms seems like a reasonable timeout setting to me, but I don't completely understand all of the mechanics here. Should I try increasing a timeout setting?

The database server cluster is hardware, on servers with 64 cores and 512GB RAM, and should be capable of handling a large volume of writes, so it doesn't make sense to me that these sink connectors are performing so much worse than they do when I use them on the relatively small RDS instances.

Here are the settings currently:

    ...
    consumer.override.max.poll.records: 1000
    consumer.override.max.partition.fetch.bytes: 17000000
    batch.size: 1000
    auto.create: false
    auto.evolve: false
    delete.enabled: true
    pk.mode: record_key
    pk.fields: retrace_id,red_id
    insert.mode: upsert
    table.name.format: retrace_rec_sink
    tasks.max: 12
    topics: debezium.service.database.retrace_rec
    transforms: unwrap
    transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
    transforms.unwrap.drop.tombstones: false
    transforms.unwrap.delete.handling.mode: none
    ...

I tried increasing batch.size and max.poll as high as 12,000 each, but it makes latency go way up and although there seem to be more events processed when they are processed, it doesn't seem to affect overall number of events, and consumer lag will not improve. reducing them makes more consistent event processing, but again, same overall number, and always the above timeout exceptions. There is very little documentation for me to scour and so I'm grasping at straws.