Altinity / clickhouse-sink-connector

Replicate data from MySQL, Postgres and MongoDB to ClickHouse
https://www.altinity.com
Apache License 2.0
203 stars 48 forks source link

The sink-connector lightweight stops replicating #380

Open aadant opened 8 months ago

aadant commented 8 months ago

last log before it stops :

154629410 2023-11-16 22:27:30.332 [pool-15502-thread-5] INFO com.altinity.clickhouse.sink.connector.db.DbWriter  - *************** EXECUTED BATCH Successfully Records: 12************** task(0) Thread ID: pool-15502-thread-5
Nov 16, 2023 10:29:15 PM com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to dbl1:3306

Restarting the container fixes the issue.

subkanthi commented 8 months ago

Steps to reproduce: Stop MySQL server when sink connector is already running and replicating, The exception is not handled.

aadant commented 8 months ago

The issue comes from here https://github.com/shyiko/mysql-binlog-connector-java/blob/dd710a5466381faa57442977b24fceff56a0820e/src/main/java/com/github/shyiko/mysql/binlog/BinaryLogClient.java#L830

aadant commented 8 months ago

I suspect that there is a deadlock and that the connection waits forever to close. This can be confirmed by a java thread dump.

The solution found for such a situation is to restart debezium when no changes are detected for a while : https://github.com/airbytehq/airbyte/pull/24166/commits/e48492e477d71bea88d9315fb21013698d8891ff

aadant commented 8 months ago

see also https://github.com/osheroff/mysql-binlog-connector-java/issues/86

subkanthi commented 8 months ago

Store the timestamp of the last record and if its more than 5/10 mins then restart Debezium Engine.

aadant commented 8 months ago

root cause traced to this issue https://github.com/osheroff/mysql-binlog-connector-java/issues/133 (reporter)

aadant commented 7 months ago

looks like a OpenJDK 11 bug, @subkanthi let us upgrade !

aadant commented 7 months ago

Reproduced with OpenJDK 17 see https://github.com/osheroff/mysql-binlog-connector-java/issues/133

aadant commented 7 months ago

The problem continues after the fix ...

subkanthi commented 6 months ago

From analyzing the thread dump provided, the cause seems to be because of high number of GC threads

 There are 179 Garbage Collection(GC) Threads. This is a very high number of GC Threads. It may affect the application's performance. You may consider reducing GC threads.

https://blog.fastthread.io/2015/09/02/thread-dump-analysis-pattern-several-scavengers/

As shown in the image there are only 15 threads that are created by sink connector image image

subkanthi commented 6 months ago

There are also 19 threads that are blocked in TIMED_WAITING state

main
PRIORITY : 5

THREAD ID : 0X00007FFFF0024C40

NATIVE ID : 0X7

NATIVE ID (DECIMAL) : 7

STATE : TIMED_WAITING

stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00007fdfd03b7508> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.2/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@17.0.2/AbstractQueuedSynchronizer.java:1757)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:239)
at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:180)
at io.debezium.embedded.EmbeddedEngine.pollRecords(EmbeddedEngine.java:695)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:475)
at io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:229)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setupDebeziumEventCapture(DebeziumChangeEventCapture.java:483)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setup(DebeziumChangeEventCapture.java:512)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.start(ClickHouseDebeziumEmbeddedApplication.java:212)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.main(ClickHouseDebeziumEmbeddedApplication.java:88)
subkanthi commented 6 months ago

https://docs.oracle.com/en/java/javase/17/gctuning/parallel-collector1.html#GUID-5A7866BE-59DF-44AD-B51A-274DE3F2BF59

subkanthi commented 5 months ago

Workaround Solution: These two configuration variables restart.event.loop and restart.event.loop.timeout.period.secs define the restart logic. The debezium event loop will be restarted when the last timestamp of the record received exceeds the restart.event.loop.timeout.period.secs

Also the debezium event loop is restarted when the timestamp of the record stored in altinity_sink_connector.replica_source_info exceeds the threshold.

Selfeer commented 5 months ago

The issue was manually verified by the Altinity QA team and marked as qa-verified.

Build used for testing: altinityinfra/clickhouse-sink-connector:464-c979a8c2583ff701dd5b63efb8f36f7fff8111e7-lt

We've verified that setting restart.event.loop to true and specifying seconds values in restart.event.loop.timeout.period.secs configuration results in a CDC event loop restart with a corresponding log message being shown when the timeout is reached.

Values specified in restart.event.loop.timeout.period.secs configuration are respected. The usage of these configurations doesn't seem to break any other clickhouse-sink-connector-lightweight functionality.

aadant commented 3 months ago

@subkanthi a workaround is in place but this still impacts the sink-connector latency (up to restart.event.loop.timeout.period.secs), IMHO the root cause should be chased down with Debezium folks and possibly the binary log library they use.

subkanthi commented 3 months ago

Analyzing the latest thread dump, all the 8 threads in change-event-source-coordinator are in TIMED_WAITING state.

debezium-mysqlconnector-embeddedconnector-change-event-source-coordinator
PRIORITY : 5

THREAD ID : 0X00007FDDF07899B0

NATIVE ID : 0X10A

NATIVE ID (DECIMAL) : 266

STATE : TIMED_WAITING

stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00007fdfd04475f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.2/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.2/AbstractQueuedSynchronizer.java:1672)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@17.0.2/ThreadPoolExecutor.java:1464)
at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(java.base@17.0.2/Executors.java:743)
at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$727/0x00000008010f77b8.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.2/Executors.java:539)
at java.util.concurrent.FutureTask.run(java.base@17.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.2/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.2/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
subkanthi commented 3 months ago

Similar issue in debezium zulipchat https://debezium.zulipchat.com/#narrow/stream/348104-community-mysql/topic/2.2E5.2E1.2C.20stuck.20indefinitely.20until.20connector.20restart

Alexandru Cristu
3:46 AM
seems to happen after we see this: com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to ...

Task thread is waiting in doPoll:

"task-thread-source_65afd9cf169951285b70c4a1-0" #79816 prio=5 os_prio=0 cpu=2977135.32ms elapsed=79038.87s tid=0x00007f69b4205540 nid=0x13922 waiting on condition  [0x00007f6958504000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
    - parking to wait for  <0x00000005ad829900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.9/LockSupport.java:252)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@17.0.9/AbstractQueuedSynchronizer.java:1757)
    at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
    at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:252)
    at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:182)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.poll(AbstractWorkerSourceTask.java:469)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.execute(AbstractWorkerSourceTask.java:357)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.run(AbstractWorkerSourceTask.java:77)
    at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)
    at org.apache.kafka.connect.runtime.isolation.Plugins$$Lambda$953/0x00007f6a65b1cc50.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
    at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
The "change-event-source-coordinator" thread is stuck in BinaryLogClient.disconnect, which in turn is stuck waiting for the keepalive thread to finish:

"debezium-mysqlconnector-source_65afd9cf169951285b70c4a1-change-event-source-coordinator" #397 prio=5 os_prio=0 cpu=63726.27ms elapsed=482672.61s tid=0x00007f6a54458510 nid=0x1fd waiting on condition  [0x00007f6961e9c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
        - parking to wait for  <0x000000050c001a90> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.9/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.9/AbstractQueuedSynchronizer.java:1672)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@17.0.9/ThreadPoolExecutor.java:1464)
        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(java.base@17.0.9/Executors.java:743)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
        at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$1777/0x00007f6a660beac0.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
        at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
The keepalive thread is stuck waiting for SSLSocket to close:

"blc-keepalive-production-aurora-80-cluster.cluster-cfqdxwvdz9bt.us-east-1.rds.amazonaws.com:3306" #401 prio=5 os_prio=0 cpu=56.99ms elapsed=569119.58s tid=0x00007f69d8079dc0 nid=0x201 waiting on condition  [0x00007f69624a3000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
    - parking to wait for  <0x000000050c032238> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/LockSupport.java:211)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:715)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/AbstractQueuedSynchronizer.java:938)
    at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.9/ReentrantLock.java:153)
    at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.9/ReentrantLock.java:322)
    at sun.security.ssl.SSLSocketImpl.closeNotify(java.base@17.0.9/SSLSocketImpl.java:735)
    at sun.security.ssl.TransportContext.closeNotify(java.base@17.0.9/TransportContext.java:287)
    at sun.security.ssl.TransportContext.initiateOutboundClose(java.base@17.0.9/TransportContext.java:598)
    at sun.security.ssl.TransportContext.closeOutbound(java.base@17.0.9/TransportContext.java:582)
    at sun.security.ssl.SSLSocketImpl.shutdownOutput(java.base@17.0.9/SSLSocketImpl.java:875)
    at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
    at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
    at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.9/Executors.java:539)
    at java.util.concurrent.FutureTask.run(java.base@17.0.9/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.9/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.9/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
Restarting the connector caused streaming to resume, the above threads remain stuck. Restart is currently needed every few hours, but it seems to become more frequent, at some point will need to restart the JVM to clean up these leaked threads.

Investigating potentially related JDK issue https://bugs.openjdk.org/browse/JDK-8241239?focusedId=14646711&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14646711

Any suggestions how to troubleshoot this further? Have you seen this before?
subkanthi commented 3 months ago

Debezium issue - https://issues.redhat.com/browse/DBZ-7570

subkanthi commented 3 months ago

said:

It's on my radar for 2.7.0.Alpha1 Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Chris Cranford (Naros): said:

Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Unfortunately nothing really at this time.

subkanthi commented 2 months ago

Even though debezium team addressed the issue in 2.7.0 https://issues.redhat.com/browse/DBZ-7570, they have introduced another bug https://issues.redhat.com/browse/DBZ-7873