Open niyanchun opened 2 hours ago
our dolphinscheduler task sometimes failed with error below:
[WI-0][TI-0] - [ERROR] 2024-10-23 18:00:15.480 +0800 o.a.d.s.m.r.BaseTaskDispatcher:[58] - Dispatch task: 看板推送任务实例同步 failed, worker group not found.
org.apache.dolphinscheduler.server.master.dispatch.exceptions.WorkerGroupNotFoundException: Cannot find worker group: Can not find worker group 数仓
at org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager.getWorkerHostWeights(LowerWeightHostManager.java:157)
at org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager.select(LowerWeightHostManager.java:74)
at org.apache.dolphinscheduler.server.master.runner.dispatcher.WorkerTaskDispatcher.getTaskInstanceDispatchHost(WorkerTaskDispatcher.java:78)
at org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.dispatchTask(BaseTaskDispatcher.java:55)
at org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper.run(GlobalTaskDispatchWaitingQueueLooper.java:80)
and the work group definitely exists, 800+ task use this worker group, only some task failed randomly. It also report "Cannot find worker group: Can not find worker group default" sometimes. These error will always succeed with a retry/retrun. I looked into the source code, I wonder if the workerHostWeightsMap
will change while running? If yes, how? Thank you.
private Set<HostWeight> getWorkerHostWeights(String workerGroup) throws WorkerGroupNotFoundException {
workerGroupReadLock.lock();
try {
Set<HostWeight> hostWeights = workerHostWeightsMap.get(workerGroup);
if (hostWeights == null) {
throw new WorkerGroupNotFoundException("Can not find worker group " + workerGroup);
}
return hostWeights;
} finally {
workerGroupReadLock.unlock();
}
}
work group should always been found
cannot reproduce for specified task, rerun will succeed, but always happens everyday for random tasks
No response
3.2.x
Search before asking
What happened
our dolphinscheduler task sometimes failed with error below:
and the work group definitely exists, 800+ task use this worker group, only some task failed randomly. It also report "Cannot find worker group: Can not find worker group default" sometimes. These error will always succeed with a retry/retrun. I looked into the source code, I wonder if the
workerHostWeightsMap
will change while running? If yes, how? Thank you.What you expected to happen
work group should always been found
How to reproduce
cannot reproduce for specified task, rerun will succeed, but always happens everyday for random tasks
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
Code of Conduct