MySQL Snapshot fails at 50m rows and keeps starting over and over again

poweroftrue commented 1 week ago

Hello there,

The lightweight version is a true magic, one container and you have near real time replication from MySQL to Clickhouse with all the columns moved right for you, a piece of art!

For a while now I'm debugging the issue java.lang.InterruptedException: null happens at ~50 million rows crashes and keep on happening.

java.lang.InterruptedException: null

    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1638) ~[?:?]

    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435) ~[?:?]

    at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:200) ~[?:?]

    at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:515) ~[app.jar:?]

    at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:171) ~[app.jar:?]

    at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:96) ~[app.jar:?]

    at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:285) ~[app.jar:?]

    at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:269) ~[app.jar:?]

    at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:192) ~[app.jar:?]

    at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:143) ~[app.jar:?]

    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]

    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

    at java.lang.Thread.run(Thread.java:831) [?:?]

What I have tired so far:

Increased all the timeout in MySQL to the max:

SELECT 
variable_name, variable_value
FROM
performance_schema.global_variables
WHERE
variable_name IN ('interactive_timeout' , 'wait_timeout',
    'connect_timeout',
    'net_read_timeout',
    'net_write_timeout',
    'lock_wait_timeout',
    'innodb_lock_wait_timeout',
    'max_execution_time')
ORDER BY variable_name;

Increased memory and watched the used memory using jcmd 1 GC.heap_info inside the container:

So, it's not memory usage problem

Used different versions (branches) complied them and deployed to k8s but still same error.
Tried on different big tables, and it do the same but it breaks at different row size, the other table were smaller.

I'm using no locks snapshot.locking.mode: "none"

What I'm trying now is attach local Intellij remote debugger, but I'm having hard time stopping on the right exception.

aadant commented 1 week ago

@poweroftrue for MySQL, you can take a snapshot via MySQLshell (see python utilities in this project). It takes a consistent snapshot, then you can restore it to CH and just start replicating from the snapshot position and even before.

You can dump and TB of data this way.

poweroftrue commented 1 week ago

Thanks @aadant!

Yes I checked the tool, it's amazing it dumps TBs of data very fast, I will use it after I try couple of weeks on this issue because it fit my use case better.

Having Debezium snapshot working open the door for simpler automation running this as a sidecar container for large number of MySQL services.

poweroftrue commented 1 week ago

Hello @aadant, I finally found what causes this issue!

I was debugging if certain rows causes the interruption but it wasn't, I enabled all logs (DEBUG) from log4j.xml and I found that Debezium was asked to stop.

It turned out to be restart.event.loop kicking in and restarting Debezium for not having events (normal during this kind of snapshot)

2024-09-12 03:03:11.146 INFO  - Stopping the embedded engine

2024-09-12 03:03:11.146 INFO  - Waiting for PT5M for connector to stop

2024-09-12 03:03:11.162 INFO  - Stopping the task and engine

2024-09-12 03:03:11.162 INFO  - Stopping down connector

2024-09-12 03:04:41.162 WARN  - Coordinator didn't stop in the expected time, shutting down executor now

I set it to false and now it's snapshotting rows beyond 100 million.

Should I add to docs that you need this disabled during the snapshot?

aadant commented 1 week ago

Good catch ! I think it is a bug. We should fix this as it is very misleading. The connector should only restart if it is idle (this is a workaround against a Debezium bug that should be fixed in the latest releases).

aadant commented 1 week ago

See https://github.com/Altinity/clickhouse-sink-connector/issues/380 It is recommended to replicate with a heartbeat table to know if the connector is lagging. If they are no updates, the connector will not update the replica_source_info table.

@subkanthi let us fix this even if there is a workaround.

Altinity / clickhouse-sink-connector

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812