Closed minyk closed 1 year ago
Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
#troubleshooting
I think master2's current slot miscalculated at this point:
[INFO] 2023-04-09 19:50:21.428 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
[INFO] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
ADD
notification process at ServerNodeManager:298
I don't know why yet, but I think during updateMasterNodes
Collection<String> currentNodes = registryClient.getMasterNodesDirectly();
List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);
currentNodes
is empty but masterNodes
has some items(at least one).
Just logging invalid. We could not try to recover this state?
Hi, @minyk , in MasterConnectionStateListener
of version 3.0.x, when the connection state change to RECONNECTED
, master node will be removed and create new one.
https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterConnectionStateListener.java#L50-L54
However, when creating new ephemeral node, we don't set heartBeat json as its initial value like
registryClient.persistEphemeral(masterRegistryPath, JSONUtils.toJsonString(masterHeartBeatTask.getHeartBeat()));
Information of master nodes will only be updated when handling node add and remove event in ServerNodeManager
https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L313-L329
In getServerList
of 3.0.x version, if we don't get heartBeat info, we will skip this node.
https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/registry/RegistryClient.java#L94-L103
Thus, when master2 execute syncMasterNodes
, it can not find itself in masterPriorityQueue
. Information of master node will not be updated any more, so it will keep writing warning message in master2.
https://github.com/apache/dolphinscheduler/blob/565bc978eac5a72a073848b440d75b6367b4ad0e/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/registry/ServerNodeManager.java#L356-L363
You can try to update your DS version to 3.1.x, we provide stop/waiting strategy, this bug doesn't exist :D
I don't know why yet, but I think during
updateMasterNodes
Collection<String> currentNodes = registryClient.getMasterNodesDirectly(); List<Server> masterNodes = registryClient.getServerList(NodeType.MASTER);
currentNodes
is empty butmasterNodes
has some items(at least one).
Actually, currentNodes
is [master0, master1, master2]
, masterNodes
is [master0, master1]
.
@Radeity Thank you for the answer. We're using Dolphinscheduler through API, so it's hard to update to 3.1.x.
We are currently testing custom code changes; retry updateMasterNodes()
when findCommands()
is empty & ServerNodeManager. MASTER_SIZE
<= 0 in MasterSchedulerBootstrap. Then after 0 masters 3 times in a row, restart master through registryClient.getStoppable().stop()
.
If you want to see changes, this is our working branch: https://github.com/nexr/dolphinscheduler/commits/3.0.5-nr
Thank you again.
We are currently testing custom code changes; retry updateMasterNodes() when findCommands() is empty & ServerNodeManager. MASTER_SIZE <= 0 in MasterSchedulerBootstrap.
That can work, however, sometimes there's no commands need to be consumed, you still make some unnecessary communication to zookeeper. I think it's better to write an initial heartbeat info during reconnection :D
I'll try to figure out the most suitable fix in next 3.0.x version. You can assign me @SbloodyS
Hi minyk, I‘m the advocate of the Apache DolphinScheduler community, you can call me niko,Can I talk to you sometime about your use of DolphinScheduler? I will call the PMC of the community to see if I can help you solve some problems, I think the meeting may be more efficient
I created a new branch https://github.com/apache/dolphinscheduler/tree/3.0.6-prepare to track this issue, we should try to fix it and then ask @minyk to test whether if work or not, if the patch it is work we will try to release version 3.0.6 to fix this issue. cc @Radeity
Hi, @minyk , i've linked a pr to fix this bug, it's a better fix which can solve the problem from the origin and will not incur high overhead, can you test this solution in your production environment? If it works, you can replace your previous fix in https://github.com/nexr/dolphinscheduler/commits/3.0.5-nr, either.
@Radeity the prod env is our customer's, not our's. So the testing is not easy. I'll try to first on our QA env and try to make a case. Thank you!
@minyk Thanks, hope it helps, looking forward to your feedback :D
I would appreciate it if you could let me know how engineers in Korea use Apache DolphinScheduler @minyk If you don't mind, we can arrange a ZOOM meeting, and I will arrange for the PMC of the community to communicate with you
@Niko-Zeng 2022 Nov to 2023 Feb, we adopt DolphinScheduler to our own product to replace Apache Oozie. But sadly this feature https://github.com/apache/dolphinscheduler/pull/13184 is too late for us. We're moving to the another scheduler framework now.
@Niko-Zeng 2022 Nov to 2023 Feb, we adopt DolphinScheduler to our own product to replace Apache Oozie. But sadly this feature #13184 is too late for us. We're moving to the another scheduler framework now.
So sorry to hear about that, but thanks for your attention to dolphinscheduler. We will speed up the release step and release new features ASAP. Hope that one day in the future you will pay attention and choose dolphinscheduler one again
@Radeity Hi. As you mentioned in the above, when reconnected happen, the mater can not find its self because its heartbeat information is set to empty in zk. I can not reproduce this bug, and as I see, in the MasterHeartBeatTask.java, the heartbeat information will update every 10s, so it is not keep empty all the time. I also get this bug in production environment, so I try to reproduce but I failed. can you give me some help, Thanks!
Hi, @minyk , in
MasterConnectionStateListener
of version 3.0.x, when the connection state change toRECONNECTED
, master node will be removed and create new one.However, when creating new ephemeral node, we don't set heartBeat json as its initial value like
registryClient.persistEphemeral(masterRegistryPath, JSONUtils.toJsonString(masterHeartBeatTask.getHeartBeat()));
Information of master nodes will only be updated when handling node add and remove event in
ServerNodeManager
In
getServerList
of 3.0.x version, if we don't get heartBeat info, we will skip this node.Thus, when master2 execute
syncMasterNodes
, it can not find itself inmasterPriorityQueue
. Information of master node will not be updated any more, so it will keep writing warning message in master2.You can try to update your DS version to 3.1.x, we provide stop/waiting strategy, this bug doesn't exist :D
Hi @xiaolailong , you can upgrade to 3.0.6, the bug is already fixed in https://github.com/apache/dolphinscheduler/pull/14014
Search before asking
What happened
We use Dolphinscheduler on K8S and deploy 3 masters. Version is 3.0.5
After zookeeper disconnect, registry could reconnect successfully. but current slot is miscalculated. below is our logs(edited):
master0:
master1:
master2:
master2 is briefly disconnected and reconnected. after disconnect master0 changed slot from 2 to 0. then reconnect, master2 is still slot 0, not 2 causing our 1/3 requests are not handled by any master.
master2 detailed logs are here:
What you expected to happen
The reconnected master2 should calculate current slot 2 not 0.
How to reproduce
I'm not sure how to reproduce this problem. But in our envs, this happens once every 3 - 4 days.
Anything else
No response
Version
3.0.x
Are you willing to submit PR?
Code of Conduct