apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.83k stars 4.61k forks source link

[Bug] [Module Name]Process Instance is in RUNNING status for a long time. #11227

Closed MarsByteso0 closed 1 year ago

MarsByteso0 commented 2 years ago

Search before asking

What happened

Process Instance is in RUNNING status for a long time.

What you expected to happen

Process Instance should be SUCCESS

How to reproduce

I create a process use python . python code is "print(1)" .

Anything else

No response

Version

2.0.6

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

MarsByteso0 commented 2 years ago

image image

SbloodyS commented 2 years ago

Could you provide the steps to reproduce this issue?

CainGao commented 2 years ago

I don't know how to reproduce this... It appears occasionally...

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

danielfree commented 2 years ago

@SbloodyS I've seen similar issues in 2.0.6, from the log it will show PROCESS_TIMEOUT. I suspect it might be related with un-necessary failover info appeared in the log.

instance id 153:

[INFO] 2022-09-08 23:30:00.543 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] - find one command: id: 154, type: SCHEDULER [INFO] 2022-09-08 23:30:00.546 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] - handle command end, command 154 process 153 start... [INFO] 2022-09-08 23:30:00.551 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:ssh-check, task id:0, task code:6565890718688 [INFO] 2022-09-08 23:30:00.553 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : ssh-check, instance id:153, state: RUNNING_EXECUTION .... ... [INFO] 2022-09-08 23:31:47.712 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: SUCCESS task instance id: 507 process instance id: 153 context: null [INFO] 2022-09-08 23:31:47.713 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 153 task 507 state:SUCCESS [INFO] 2022-09-08 23:40:27.862 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started [INFO] 2022-09-08 23:40:27.865 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678] [INFO] 2022-09-08 23:40:27.869 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2 [INFO] 2022-09-08 23:40:27.871 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] - master[dolphin-master-1.dolphin-master-headless:5678] failover end, useTime:4ms [INFO] 2022-09-08 23:45:04.780 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 153 , events count:1 [INFO] 2022-09-08 23:45:04.780 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0 [INFO] 2022-09-08 23:45:04.780 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event : key : null type: PROCESS_TIMEOUT executeStatus: null task instance id: 0 process instance id: 153 context: null [INFO] 2022-09-08 23:50:27.872 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started [INFO] 2022-09-08 23:50:27.875 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678] [INFO] 2022-09-08 23:50:27.879 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2 [INFO] 2022-09-08 23:50:27.881 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] - master[dolphin-master-1.dolphin-master-headless:5678] failover end, useTime:4ms

I'm deploying ds in k8s with two masters, my question is why this 'need failover' happens a lot? maybe related with some configs in zookeeper?

danielfree commented 2 years ago

@SbloodyS correct me if I'm wrong, from https://github.com/apache/dolphinscheduler/blob/2.0.6-release/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/master/runner/FailoverExecuteThread.java#L95 seems the master server will always failover itself ? why is that?

brave-lee commented 2 years ago

I think this has nothing to do with the master failover. Can you provide the logs of the instance id 153 of the worker

danielfree commented 2 years ago

I think this has nothing to do with the master failover. Can you provide the logs of the instance id 153 of the worker

Here's another case with my running installation:

instance id: 211 task instance id: 672 logs from worker:

[INFO] 2022-09-16 07:50:01.165 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[110] - received command : TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":672,"taskName":"flightradar_top_routes","firstSubmitTime":"2022-09-16 07:50:01","startTime":null,"endTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":6629036562401,"processDefineVersion":3,"appIds":null,"processInstanceId":211,"scheduleTime":"2022-09-16 07:50:00","globalParams":null,"executorId":2,"cmdTypeIfComplement":6,"tenantCode":"aspex","queue":"default","projectCode":6565237764192,"taskParams":"***","envFile":null,"environmentConfig":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","delayTime":0,"currentExecutionStatus":null,"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null,"defaultFS":null},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"varPool":null,"dryRun":0,"paramsMap":null,"procedureTaskExecutionContext":{"connectionParams":null}}'}
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task instance local execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 672
[INFO] 2022-09-16 07:50:01.265 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[60] - FINALIZE_SESSION
...
[INFO] 2022-09-16 07:50:01.287 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[138] - task execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.291 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[178] - process start, process id is: 24435
07:50:42.439 [pool-2-thread-4] INFO org.apache.dolphinscheduler.server.log.LoggerRequestProcessor - received command : Command [type=VIEW_WHOLE_LOG_REQUEST, opaque=695701, genCommandTimeMillis=0, bodyLen=82]
[INFO] 2022-09-16 07:53:58.443 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[202] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672, processId:24435 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0
[INFO] 2022-09-16 07:53:58.444 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[206] - task instance id : 672,task final status : SUCCESS
[INFO] 2022-09-16 07:53:58.444 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[264] - develop mode is: false
[INFO] 2022-09-16 07:53:58.445 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[282] - exec local path: /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672 cleared.
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task instance local execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 672

Note that at [INFO] 2022-09-16 07:53:59.196 the task instance 672 is executed again after a SUCCESS run

master logs:

[INFO] 2022-09-16 07:50:00.012 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[255] - find command 212, slot:0 :
[INFO] 2022-09-16 07:50:00.013 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] - find one command: id: 212, type: SCHEDULER
[INFO] 2022-09-16 07:50:00.016 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] - handle command end, command 212 process 211 start...
[INFO] 2022-09-16 07:50:00.025 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:ssh-check, task id:0, task code:6629036562272
[INFO] 2022-09-16 07:50:00.027 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : ssh-check, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:00.031 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:671 ssh-check state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION  
[INFO] 2022-09-16 07:50:00.037 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:671, task name:ssh-check, process id:211
[INFO] 2022-09-16 07:50:00.039 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] - remove task from stand by list, id: 671 name:ssh-check
[INFO] 2022-09-16 07:50:00.042 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16 07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671', processInstanceId='0'}
[INFO] 2022-09-16 07:50:00.042 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16 07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671', processInstanceId='211'}
[INFO] 2022-09-16 07:50:00.964 org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=671, processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:50:00 CST 2022, host=null, logPath=null, executePath=null, processId=24426, appIds='', varPool=[]}
[INFO] 2022-09-16 07:50:01.022 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:50:01.026 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:50:01.144 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:01.145 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-09-16 07:50:01.145 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: SUCCESS task instance id: 671 process instance id: 211 context: null
[INFO] 2022-09-16 07:50:01.146 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 671 state:SUCCESS 
[INFO] 2022-09-16 07:50:01.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:flightradar_top_routes, task id:0, task code:6629036562400
[INFO] 2022-09-16 07:50:01.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] - taskCode: 6629036562400 completeDependTaskList: [6629036562272]
[INFO] 2022-09-16 07:50:01.151 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:01.153 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION  
[INFO] 2022-09-16 07:50:01.159 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:672, task name:flightradar_top_routes, process id:211
[INFO] 2022-09-16 07:50:01.161 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] - remove task from stand by list, id: 672 name:flightradar_top_routes
[INFO] 2022-09-16 07:50:01.168 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16 07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672', processInstanceId='0'}
[INFO] 2022-09-16 07:50:01.168 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16 07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672', processInstanceId='211'}
[INFO] 2022-09-16 07:50:02.022 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:50:02.025 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:50:41.424 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
[INFO] 2022-09-16 07:50:41.426 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2
[INFO] 2022-09-16 07:50:41.431 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] - failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:41.424 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
[INFO] 2022-09-16 07:50:41.426 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2
**[INFO] 2022-09-16 07:50:41.431** org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] - failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:42.432 org.apache.dolphinscheduler.service.log.LogClientService:[117] - view log path /opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log
[INFO] 2022-09-16 07:50:42.441 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2022-09-16 07:50:42.445 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[ERROR] 2022-09-16 07:50:42.445 org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job failure
java.lang.NullPointerException: null
    at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197)
    at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:490)
    at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverMaster(MasterRegistryClient.java:443)
    at org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread.run(FailoverExecuteThread.java:80)
[INFO] 2022-09-16 07:50:42.447 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] - master[dolphin-master-1.dolphin-master-headless:5678] failover end, useTime:1019ms
[INFO] 2022-09-16 07:50:43.147 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:43.148 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-09-16 07:50:43.148 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: NEED_FAULT_TOLERANCE task instance id: 672 process instance id: 211 context: null
[INFO] 2022-09-16 07:50:43.149 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 672 state:NEED_FAULT_TOLERANCE 
[INFO] 2022-09-16 07:50:43.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:flightradar_top_routes, task id:672, task code:6629036562400
**[INFO] 2022-09-16 07:53:58.445** org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=672, processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:53:58 CST 2022, host=null, logPath=null, executePath=null, processId=24435, appIds='', varPool=[]}
**[INFO] 2022-09-16 07:53:59.041** org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:53:59.044 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:53:59.168 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:53:59.168 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
**[INFO] 2022-09-16 07:53:59.168** org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: SUCCESS task instance id: 672 process instance id: 211 context: null
[INFO] 2022-09-16 07:53:59.169 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 672 state:SUCCESS 
[INFO] 2022-09-16 07:53:59.172 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] - taskCode: 6629036562400 completeDependTaskList: [6629036562272, 6629036562400]
[INFO] 2022-09-16 07:53:59.173 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:53:59.175 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION

Note at [INFO] 2022-09-16 07:50:41.431 the master somehow starts to failover with running task 672 in worker, and in [INFO] 2022-09-16 07:53:59.168 the task status is SUCCESS but the master submits the task again in [INFO] 2022-09-16 07:53:59.173

SbloodyS commented 2 years ago

The root cause for master failover is encountered NPE. It can be seen that the kill yarn job failed from your log. Please check whether the configuration related to the yarn rest api is correct and whether the yarn service is running normally. @danielfree

danielfree commented 2 years ago

The root cause for master failover is encountered NPE. It can be seen that the kill yarn job failed from your log. Please check whether the configuration related to the yarn rest api is correct and whether the yarn service is running normally. @danielfree

thx for your reply, but this is running in k8s, there's no yarn env by default

SbloodyS commented 2 years ago

thx for your reply, but this is running in k8s, there's no yarn env by default

You can take a look at the NOTE part in the document. The unfriendly thing is that there is no switch to control the start and stop of this function. https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/task/shell.html

danielfree commented 2 years ago

thx for your reply, but this is running in k8s, there's no yarn env by default

You can take a look at the NOTE part in the document. The unfriendly thing is that there is no switch to control the start and stop of this function. https://dolphinscheduler.apache.org/en-us/docs/latest/user_doc/guide/task/shell.html

interesting, I don't have any logs like application_xxx_xxx, the error traces back to https://github.com/apache/dolphinscheduler/blob/2.0.6-release/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/utils/ProcessUtils.java#L197

brave-lee commented 2 years ago

I think this has nothing to do with the master failover. Can you provide the logs of the instance id 153 of the worker

Here's another case with my running installation:

instance id: 211 task instance id: 672 logs from worker:

[INFO] 2022-09-16 07:50:01.165 org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[110] - received command : TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":672,"taskName":"flightradar_top_routes","firstSubmitTime":"2022-09-16 07:50:01","startTime":null,"endTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":6629036562401,"processDefineVersion":3,"appIds":null,"processInstanceId":211,"scheduleTime":"2022-09-16 07:50:00","globalParams":null,"executorId":2,"cmdTypeIfComplement":6,"tenantCode":"aspex","queue":"default","projectCode":6565237764192,"taskParams":"***","envFile":null,"environmentConfig":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","delayTime":0,"currentExecutionStatus":null,"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null,"defaultFS":null},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"varPool":null,"dryRun":0,"paramsMap":null,"procedureTaskExecutionContext":{"connectionParams":null}}'}
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task instance local execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 672
[INFO] 2022-09-16 07:50:01.265 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[60] - FINALIZE_SESSION
...
[INFO] 2022-09-16 07:50:01.287 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[138] - task execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.291 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[178] - process start, process id is: 24435
07:50:42.439 [pool-2-thread-4] INFO org.apache.dolphinscheduler.server.log.LoggerRequestProcessor - received command : Command [type=VIEW_WHOLE_LOG_REQUEST, opaque=695701, genCommandTimeMillis=0, bodyLen=82]
[INFO] 2022-09-16 07:53:58.443 TaskLogLogger-class org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[202] - process has exited, execute path:/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672, processId:24435 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0
[INFO] 2022-09-16 07:53:58.444 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[206] - task instance id : 672,task final status : SUCCESS
[INFO] 2022-09-16 07:53:58.444 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[264] - develop mode is: false
[INFO] 2022-09-16 07:53:58.445 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[282] - exec local path: /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672 cleared.
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task instance local execute path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196 org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the task begins to execute. task instance id: 672

Note that at [INFO] 2022-09-16 07:53:59.196 the task instance 672 is executed again after a SUCCESS run

master logs:

[INFO] 2022-09-16 07:50:00.012 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[255] - find command 212, slot:0 :
[INFO] 2022-09-16 07:50:00.013 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] - find one command: id: 212, type: SCHEDULER
[INFO] 2022-09-16 07:50:00.016 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] - handle command end, command 212 process 211 start...
[INFO] 2022-09-16 07:50:00.025 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:ssh-check, task id:0, task code:6629036562272
[INFO] 2022-09-16 07:50:00.027 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : ssh-check, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:00.031 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:671 ssh-check state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION  
[INFO] 2022-09-16 07:50:00.037 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:671, task name:ssh-check, process id:211
[INFO] 2022-09-16 07:50:00.039 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] - remove task from stand by list, id: 671 name:ssh-check
[INFO] 2022-09-16 07:50:00.042 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16 07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671', processInstanceId='0'}
[INFO] 2022-09-16 07:50:00.042 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16 07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671', processInstanceId='211'}
[INFO] 2022-09-16 07:50:00.964 org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=671, processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:50:00 CST 2022, host=null, logPath=null, executePath=null, processId=24426, appIds='', varPool=[]}
[INFO] 2022-09-16 07:50:01.022 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:50:01.026 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:50:01.144 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:01.145 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-09-16 07:50:01.145 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: SUCCESS task instance id: 671 process instance id: 211 context: null
[INFO] 2022-09-16 07:50:01.146 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 671 state:SUCCESS 
[INFO] 2022-09-16 07:50:01.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:flightradar_top_routes, task id:0, task code:6629036562400
[INFO] 2022-09-16 07:50:01.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] - taskCode: 6629036562400 completeDependTaskList: [6629036562272]
[INFO] 2022-09-16 07:50:01.151 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:01.153 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION  
[INFO] 2022-09-16 07:50:01.159 org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164] - master submit success, task id:672, task name:flightradar_top_routes, process id:211
[INFO] 2022-09-16 07:50:01.161 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] - remove task from stand by list, id: 672 name:flightradar_top_routes
[INFO] 2022-09-16 07:50:01.168 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16 07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672', processInstanceId='0'}
[INFO] 2022-09-16 07:50:01.168 org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] - taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16 07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234', status=1, logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log', executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672', processInstanceId='211'}
[INFO] 2022-09-16 07:50:02.022 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:50:02.025 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:50:41.424 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
[INFO] 2022-09-16 07:50:41.426 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2
[INFO] 2022-09-16 07:50:41.431 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] - failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:41.424 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] - failover execute started
[INFO] 2022-09-16 07:50:41.426 org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] - need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] - start master[dolphin-master-1.dolphin-master-headless:5678] failover, process list size:2
**[INFO] 2022-09-16 07:50:41.431** org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] - failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:42.432 org.apache.dolphinscheduler.service.log.LogClientService:[117] - view log path /opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log
[INFO] 2022-09-16 07:50:42.441 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2022-09-16 07:50:42.445 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[ERROR] 2022-09-16 07:50:42.445 org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job failure
java.lang.NullPointerException: null
  at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197)
  at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:490)
  at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverMaster(MasterRegistryClient.java:443)
  at org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread.run(FailoverExecuteThread.java:80)
[INFO] 2022-09-16 07:50:42.447 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] - master[dolphin-master-1.dolphin-master-headless:5678] failover end, useTime:1019ms
[INFO] 2022-09-16 07:50:43.147 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:43.148 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
[INFO] 2022-09-16 07:50:43.148 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: NEED_FAULT_TOLERANCE task instance id: 672 process instance id: 211 context: null
[INFO] 2022-09-16 07:50:43.149 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 672 state:NEED_FAULT_TOLERANCE 
[INFO] 2022-09-16 07:50:43.150 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] - add task to stand by list, task name:flightradar_top_routes, task id:672, task code:6629036562400
**[INFO] 2022-09-16 07:53:58.445** org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72] - received command : TaskExecuteResponseCommand{taskInstanceId=672, processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:53:58 CST 2022, host=null, logPath=null, executePath=null, processId=24435, appIds='', varPool=[]}
**[INFO] 2022-09-16 07:53:59.041** org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223] - already exists handler process size:0
[INFO] 2022-09-16 07:53:59.044 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230] - persist events 211 succeeded.
[INFO] 2022-09-16 07:53:59.168 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] - handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:53:59.168 org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] - already exists handler process size:0
**[INFO] 2022-09-16 07:53:59.168** org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] - process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus: SUCCESS task instance id: 672 process instance id: 211 context: null
[INFO] 2022-09-16 07:53:59.169 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] - work flow 211 task 672 state:SUCCESS 
[INFO] 2022-09-16 07:53:59.172 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] - taskCode: 6629036562400 completeDependTaskList: [6629036562272, 6629036562400]
[INFO] 2022-09-16 07:53:59.173 org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:53:59.175 org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS complete, instance id:211 state: RUNNING_EXECUTION

Note at [INFO] 2022-09-16 07:50:41.431 the master somehow starts to failover with running task 672 in worker, and in [INFO] 2022-09-16 07:53:59.168 the task status is SUCCESS but the master submits the task again in [INFO] 2022-09-16 07:53:59.173

The log taskInstance id 672 does not end

brave-lee commented 2 years ago

Judging from the log, it is found that the readyToSubmitTaskQueue will not be removed after the task succeeds, which is impossible in the 206 code. Please redeploy the code to try.

brave-lee commented 1 year ago

It cannot reproduce in the 207 code, I close it. If someone can reproduce, please submit new issue to discuss.