Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.83k stars 2.33k forks source link

Subworkflow got COMPLETED but main workflow is still in RUNNING state #1955

Closed AnnsPhilip closed 3 years ago

AnnsPhilip commented 3 years ago

We are using V2.29.0 release with Redis and Elasticsearch. We are facing issue after executing workflow. Subworkflow is completed but main workflow is still in running state. Is there any configuration to be added or am I missing anything? subworkflow MainWorkflow

james-deee commented 3 years ago

Are you manually marking the "task" as completed in anyway? I think that if you just mark the Subworkflow task as COMPLETED, that doesn't mark the subworkflow completed in any way.

AnnsPhilip commented 3 years ago

Thank you for the reply @james-deee. We are not updating task manually, each subworkflow task is linked to worker and when worker completes automatically task status changes to completed.

james-deee commented 3 years ago

Hmmm. Alright, well we haven't experienced this yet on the latest version, sorry I don't have a better answer. You could try debugging the server.


From: AnnsPhilip notifications@github.com Sent: Thursday, November 5, 2020 1:07:53 AM To: Netflix/conductor conductor@noreply.github.com Cc: Jamie DeMichele demichej@hotmail.com; Comment comment@noreply.github.com Subject: Re: [Netflix/conductor] Subworkflow got COMPLETED but main workflow is still in RUNNING state (#1955)

No. Each subworkflow task is linked to worker and when worker completes automatically task status changes to completed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/Netflix/conductor/issues/1955#issuecomment-722187723, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADHDKZW6JBXXG7OQNEAJTDSOJFMTANCNFSM4TJYTQQQ.

AnnsPhilip commented 3 years ago

Yes @james-deee We are trying to debug the server

AnnsPhilip commented 3 years ago

@james-deee Is there any Redis queue size configuration? Am getting below warning in logs 18995820 [pool-21-thread-1] WARN com.netflix.dyno.queues.redis.RedisDynoQueue - cannot add d728a994-4f9c-41e3-902a-12dd6ff36827 to the unack shard conductor_queues.test.UNACK._deciderQueue.c

AnnsPhilip commented 3 years ago

Below is the config.properties file configuration used by Netflix conductor

conductor.jetty.server.enabled=true conductor.grpc.server.enabled=false db=redis

workflow.dynomite.cluster.hosts=redis:6379:us-east-1c

workflow.namespace.prefix=conductor

workflow.namespace.queue.prefix=conductor_queues

queues.dynomite.threads=10

workflow.dynomite.connection.maxConnsPerHost=31

queues.dynomite.nonQuorum.port=6379

workflow.elasticsearch.instanceType=external

workflow.elasticsearch.async.dao.worker.queue.size=500

workflow.elasticsearch.async.dao.max.pool.size=500

async.indexing.enabled=true

workflow.elasticsearch.url=elasticsearch:9300

workflow.elasticsearch.index.name=conductor

workflow.owner.email.mandatory=false

kishorebanala commented 3 years ago

@AnnsPhilip Can you please provide the Subworkflow and Workflow state and logs, when the subworkflow is completed.

Upon subworkflow completion, the subworkflow task should be marked completed. And, if the subworkflow task is the last task in parent workflow, it should complete as well.

If the subworkflow task is not being updated on Subworkflow completion, it'd help to debug if there are any errors in updating the subworkflow task.

AnnsPhilip commented 3 years ago

@kishorebanala Attaching main workflow and subworkflow screenshots. Order_608655315960955689 is main workflow which is in RUNNING state and CREATE_SUBSCRIBER_INS is task(it is a subworkflow) which is in IN-PROGRESS state. We had added a trigger from subworkflow class(on IN-PROGRESS and COMPLETED/FAILED of subworkflow) to our application to update subworkflow status in our table, in logs we are able to see subworkflow start trigger but COMPLETED logs are not there(we are not receiving any trigger from Subworkflow class). In 100 workflows 10 to 20 are getting stuck. NetflixLogs.txt subworkflow2 MainWorkflow2

AnnsPhilip commented 3 years ago

@kishorebanala Any update based on my last comment?

kishorebanala commented 3 years ago

@AnnsPhilip From what you've mentioned, it looks like the Subworkflow's complete workflow logic is stopping here, and thus the subsequent workflow status listener is not invoked.

I couldn't trace why the updateParentWorkflowTask call if failing from the logs you've provided. I'm also not sure which workflow/task this log corresponds to:

4789847 [system-task-worker-6] INFO com.netflix.conductor.core.execution.WorkflowExecutor - Task SUB_WORKFLOW/c5c2f688-0732-44d5-9d8d-422f4504c2a9 was already completed.

Can you help trace the reason for updateParentWorkflowTask failure in L#863 above by checking why the update task is failing, and if possible by providing specific logs for this error, correlating it to the relevant sub workflow and task executions please. I'll try and reproduce this meanwhile.

AnnsPhilip commented 3 years ago

@kishorebanala Thank you for the reply. We had added IN-PROGRESS trigger to our application at end of start method and COMPLETED trigger at the end of execute method of class this. As per our observation before IN-PROGRESS trigger completes, execute method got executed and workflowId was null, hence execute method was returning false. Attaching the logs. SubworkflowLogs.txt

I suspect issue is due to the addition of IN-PROGRESS trigger code.

kishorebanala commented 3 years ago

I suspect issue is due to the addition of IN-PROGRESS trigger code.

Hey @AnnsPhilip, I'm not sure I follow.. Is this any custom code / call back to Sub-workflow's start that get's in the way of execution?

techyragu commented 3 years ago

@AnnsPhilip Is this issue resolved ?

AnnsPhilip commented 3 years ago

@techyragu @kishorebanal Issue got solved after removing IN-PROGRESS trigger code. @kishorebanala @james-deee Thank you for the support. Closing the issue.

Akash47 commented 10 months ago

@AnnsPhilip can share the details where this Inprogress-code is present? is it a custom code (which is implemented via worker ) or conductor code?