[Bug] [MasterFailoverService] syncMasterNodes does not make current slot correctly after zookeeper reconnect

minyk commented 1 year ago

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

We use Dolphinscheduler on K8S and deploy 3 masters. Version is 3.0.5

After zookeeper disconnect, registry could reconnect successfully. but current slot is miscalculated. below is our logs(edited):

master0:

19:50:00.961 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 6, current slot: 2, total slot size: 3
19:50:21.305 ServerNodeManager:[302] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.316 ServerNodeManager:[364] - update master nodes, master size: 3, slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
19:50:21.323 ServerNodeManager:[298] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.330 ServerNodeManager:[364] - update master nodes, master size: 3, slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
19:50:22.019 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 1, current slot: 0, total slot size: 3

master1:

19:50:21.304 ServerNodeManager:[302] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.312 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 3, slot: 1, addr: dolphinscheduler-master-1.dolphinscheduler-master-headless:5678
19:50:21.319 ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.327 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 3, slot: 1, addr: dolphinscheduler-master-1.dolphinscheduler-master-headless:5678

master2:

19:50:01.097 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 7, current slot: 0, total slot size: 3
19:50:21.305 ServerNodeManager:[302] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.324 ServerNodeManager:[364] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
19:50:21.428 ServerNodeManager:[298] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.434 ServerNodeManager:[364] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678

master2 is briefly disconnected and reconnected. after disconnect master0 changed slot from 2 to 0. then reconnect, master2 is still slot 0, not 2 causing our 1/3 requests are not handled by any master.

master2 detailed logs are here:

[INFO] 2023-04-09 19:50:20.701 +0900 org.apache.zookeeper.ClientCnxn:[1158] - [WorkflowInstance-0][TaskInstance-0] - Unable to read additional data from server sessionid 0x3032c34cb87185b, likely server has closed socket, closing socket connection and attempting reconnect
[WARN] 2023-04-09 19:50:20.719 +0900 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[109] - [WorkflowInstance-577314][TaskInstance-0] - The workflow has been executed by another thread
[INFO] 2023-04-09 19:50:20.804 +0900 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
[WARN] 2023-04-09 19:50:20.808 +0900 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52] - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
[WARN] 2023-04-09 19:50:20.811 +0900 org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - registry connection state is SUSPENDED, ready to retry connection
[INFO] 2023-04-09 19:50:21.292 +0900 org.apache.zookeeper.ClientCnxn:[1025] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server zookeeper-1.zookeeper-headless/10.233.125.25:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2023-04-09 19:50:21.295 +0900 org.apache.zookeeper.ClientCnxn:[879] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established to zookeeper-1.zookeeper-headless/10.233.125.25:2181, initiating session
[INFO] 2023-04-09 19:50:21.299 +0900 org.apache.zookeeper.ClientCnxn:[1299] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server zookeeper-1.zookeeper-headless/10.233.125.25:2181, sessionid = 0x3032c34cb87185b, negotiated timeout = 30000
[INFO] 2023-04-09 19:50:21.300 +0900 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[INFO] 2023-04-09 19:50:21.305 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[302] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
[INFO] 2023-04-09 19:50:21.304 +0900 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[124] - [WorkflowInstance-0][TaskInstance-0] - MASTER node deleted : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:21.308 +0900 org.apache.dolphinscheduler.server.master.service.FailoverService:[53] - [WorkflowInstance-0][TaskInstance-0] - Master failover starting, masterServer: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:21.323 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - [WorkflowInstance-0][TaskInstance-0] - current addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in active master list
[INFO] 2023-04-09 19:50:21.324 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:21.428 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
[WARN] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - [WorkflowInstance-0][TaskInstance-0] - current addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in active master list
[INFO] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:22.186 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:23.187 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:24.188 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:25.189 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:26.190 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:27.191 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:28.192 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:29.192 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[INFO] 2023-04-09 19:50:29.353 +0900 org.apache.dolphinscheduler.server.master.service.FailoverService:[55] - [WorkflowInstance-0][TaskInstance-0] - Master failover finished, masterServer: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:29.353 +0900 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[63] - [WorkflowInstance-0][TaskInstance-0] - master node added : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:30.193 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:31.194 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:32.195 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:33.196 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:34.197 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:35.197 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:36.198 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:37.199 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:38.199 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:39.200 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:40.201 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:41.201 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0

What you expected to happen

The reconnected master2 should calculate current slot 2 not 0.

How to reproduce

I'm not sure how to reproduce this problem. But in our envs, this happens once every 3 - 4 days.

Anything else

No response

Version

3.0.x

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

github-actions[bot] commented 1 year ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

In order for us to understand your request as soon as possible, please provide detailed information, version or pictures.
If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

minyk commented 1 year ago

I think master2's current slot miscalculated at this point:

[INFO] 2023-04-09 19:50:21.428 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
[INFO] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678

ADD notification process at ServerNodeManager:298

minyk commented 1 year ago

I don't know why yet, but I think during updateMasterNodes

            Collection<String> currentNodes = registryClient.getMasterNodesDirectly();
            List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);

currentNodes is empty but masterNodes has some items(at least one).

minyk commented 1 year ago

https://github.com/apache/dolphinscheduler/blob/119f08d2bafd7d280dd9d323dae21ead2e08147f/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/MasterSchedulerBootstrap.java#L267-L270

Just logging invalid. We could not try to recover this state?

Radeity commented 1 year ago

Hi, @minyk , in MasterConnectionStateListener of version 3.0.x, when the connection state change to RECONNECTED, master node will be removed and create new one. https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterConnectionStateListener.java#L50-L54 However, when creating new ephemeral node, we don't set heartBeat json as its initial value like

registryClient.persistEphemeral(masterRegistryPath, JSONUtils.toJsonString(masterHeartBeatTask.getHeartBeat()));

Information of master nodes will only be updated when handling node add and remove event in ServerNodeManager https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L313-L329

In getServerList of 3.0.x version, if we don't get heartBeat info, we will skip this node. https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/registry/RegistryClient.java#L94-L103

Thus, when master2 execute syncMasterNodes, it can not find itself in masterPriorityQueue. Information of master node will not be updated any more, so it will keep writing warning message in master2. https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L356-L363

You can try to update your DS version to 3.1.x, we provide stop/waiting strategy, this bug doesn't exist :D

Radeity commented 1 year ago

I don't know why yet, but I think during updateMasterNodes
            Collection<String> currentNodes = registryClient.getMasterNodesDirectly();
            List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);
currentNodes is empty but masterNodes has some items(at least one).

Actually, currentNodes is [master0, master1, master2], masterNodes is [master0, master1].

minyk commented 1 year ago

@Radeity Thank you for the answer. We're using Dolphinscheduler through API, so it's hard to update to 3.1.x.

We are currently testing custom code changes; retry updateMasterNodes() when findCommands() is empty & ServerNodeManager. MASTER_SIZE <= 0 in MasterSchedulerBootstrap. Then after 0 masters 3 times in a row, restart master through registryClient.getStoppable().stop().

If you want to see changes, this is our working branch: https://github.com/nexr/dolphinscheduler/commits/3.0.5-nr

Thank you again.

Radeity commented 1 year ago

We are currently testing custom code changes; retry updateMasterNodes() when findCommands() is empty & ServerNodeManager. MASTER_SIZE <= 0 in MasterSchedulerBootstrap.

That can work, however, sometimes there's no commands need to be consumed, you still make some unnecessary communication to zookeeper. I think it's better to write an initial heartbeat info during reconnection :D

Radeity commented 1 year ago

I'll try to figure out the most suitable fix in next 3.0.x version. You can assign me @SbloodyS

Niko-Zeng commented 1 year ago

Hi minyk， I‘m the advocate of the Apache DolphinScheduler community， you can call me niko,Can I talk to you sometime about your use of DolphinScheduler? I will call the PMC of the community to see if I can help you solve some problems, I think the meeting may be more efficient

zhongjiajie commented 1 year ago

I created a new branch https://github.com/apache/dolphinscheduler/tree/3.0.6-prepare to track this issue, we should try to fix it and then ask @minyk to test whether if work or not, if the patch it is work we will try to release version 3.0.6 to fix this issue. cc @Radeity

Radeity commented 1 year ago

Hi, @minyk , i've linked a pr to fix this bug, it's a better fix which can solve the problem from the origin and will not incur high overhead, can you test this solution in your production environment? If it works, you can replace your previous fix in https://github.com/nexr/dolphinscheduler/commits/3.0.5-nr, either.

minyk commented 1 year ago

@Radeity the prod env is our customer's, not our's. So the testing is not easy. I'll try to first on our QA env and try to make a case. Thank you!

Radeity commented 1 year ago

@minyk Thanks, hope it helps, looking forward to your feedback :D

Niko-Zeng commented 1 year ago

I would appreciate it if you could let me know how engineers in Korea use Apache DolphinScheduler @minyk If you don't mind, we can arrange a ZOOM meeting, and I will arrange for the PMC of the community to communicate with you

minyk commented 1 year ago

@Niko-Zeng 2022 Nov to 2023 Feb, we adopt DolphinScheduler to our own product to replace Apache Oozie. But sadly this feature https://github.com/apache/dolphinscheduler/pull/13184 is too late for us. We're moving to the another scheduler framework now.

zhongjiajie commented 1 year ago

@Niko-Zeng 2022 Nov to 2023 Feb, we adopt DolphinScheduler to our own product to replace Apache Oozie. But sadly this feature #13184 is too late for us. We're moving to the another scheduler framework now.

So sorry to hear about that, but thanks for your attention to dolphinscheduler. We will speed up the release step and release new features ASAP. Hope that one day in the future you will pay attention and choose dolphinscheduler one again

xiaolailong commented 10 months ago

@Radeity Hi. As you mentioned in the above, when reconnected happen, the mater can not find its self because its heartbeat information is set to empty in zk. I can not reproduce this bug, and as I see, in the MasterHeartBeatTask.java, the heartbeat information will update every 10s, so it is not keep empty all the time. I also get this bug in production environment, so I try to reproduce but I failed. can you give me some help, Thanks!

Hi, @minyk , in MasterConnectionStateListener of version 3.0.x, when the connection state change to RECONNECTED, master node will be removed and create new one.

https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterConnectionStateListener.java#L50-L54

However, when creating new ephemeral node, we don't set heartBeat json as its initial value like
registryClient.persistEphemeral(masterRegistryPath, JSONUtils.toJsonString(masterHeartBeatTask.getHeartBeat()));
Information of master nodes will only be updated when handling node add and remove event in ServerNodeManager

https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L313-L329

In getServerList of 3.0.x version, if we don't get heartBeat info, we will skip this node.

https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/registry/RegistryClient.java#L94-L103

Thus, when master2 execute syncMasterNodes, it can not find itself in masterPriorityQueue. Information of master node will not be updated any more, so it will keep writing warning message in master2.

https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L356-L363

You can try to update your DS version to 3.1.x, we provide stop/waiting strategy, this bug doesn't exist :D

Radeity commented 10 months ago

Hi @xiaolailong , you can upgrade to 3.0.6, the bug is already fixed in https://github.com/apache/dolphinscheduler/pull/14014

apache / dolphinscheduler