Adobe-Consulting-Services / acs-aem-commons

http://adobe-consulting-services.github.io/acs-aem-commons/
Apache License 2.0
453 stars 600 forks source link

Bulk Workflow Manager | ACS AEM Commons Version : 2.1.2 #682

Closed purnendra closed 7 years ago

purnendra commented 8 years ago

ACS AEM Commons Version : 2.1.2 Bulk Workflow Manager run going to in "Stopped" state very frequently after recent installation of SP1 on AEM6.1. Looks it tries to terminate a workflow on timeout but finds that workflow is already finished .

Error : `26.03.2016 22:24:38.844 *ERROR* [pool-7-thread-3] com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl Error processing periodic execution: {}
java.lang.IllegalStateException: Workflow is already finished.
    at com.adobe.granite.workflow.core.WorkflowSessionImpl.terminateWorkflow(WorkflowSessionImpl.java:450)
    at com.day.cq.workflow.impl.CQWorkflowSessionWrapper.terminateWorkflow(CQWorkflowSessionWrapper.java:312)
    at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl.terminateActiveWorkflows(BulkWorkflowEngineImpl.java:662)
    at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl.access$500(BulkWorkflowEngineImpl.java:71)
    at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl$1.run(BulkWorkflowEngineImpl.java:272)
    at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:115)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
26.03.2016 22:24:38.844 *INFO* [pool-7-thread-3] com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl Bulk Workflow Manager stopped for [ /etc/acs-commons/bulk-workflow-manager/di-dam-update-week-13-2014/jcr:content ]`
NielsInc commented 8 years ago

Hi @davidjgonzalez I can also reproduce this issue: every two batches or so, I get the same exception.

23.05.2016 09:41:09.996 *ERROR* [pool-7-thread-3] com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl Error processing periodic execution: {}
java.lang.IllegalStateException: Workflow is already finished.
        at com.adobe.granite.workflow.core.WorkflowSessionImpl.terminateWorkflow(WorkflowSessionImpl.java:450)
        at com.day.cq.workflow.impl.CQWorkflowSessionWrapper.terminateWorkflow(CQWorkflowSessionWrapper.java:312)
        at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl.terminateActiveWorkflows(BulkWorkflowEngineImpl.java:662)
        at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl.access$500(BulkWorkflowEngineImpl.java:71)
        at com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl$1.run(BulkWorkflowEngineImpl.java:272)
        at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:115)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
23.05.2016 09:41:09.996 *INFO* [pool-7-thread-3] com.adobe.acs.commons.workflow.bulk.impl.BulkWorkflowEngineImpl Bulk Workflow Manager stopped for [ /etc/acs-commons/bulk-workflow-manager/migration-asset-renditions/jcr:content ]

Also when resuming the workflow, I get an error message ERROR RESUMING BULK WORKFLOW PROCESS, but the process resumes without issues until the next Workflow already started exception.

Any idea where I might start looking to find the issue?

davidjgonzalez commented 8 years ago

@NielsInc what version of Oak are you running? IIRC this started happening to @purnendra after he upgraded his Oak version.

It seems like the problem is there's a lag in how the WF API gets the WF state (might be async query based?) .. I worked w him and have a patched release that did a direct check of the WF instance resource for the state, but IIRC that had some funky issues as well. @purnendra do you recall what the outcome of that approach was?

@NielsInc are you running an OOTB WF, or something custom?

purnendra commented 8 years ago

@davidjgonzalez That patch did not help but I had to patch it myself to remove terminating mechanism from the code ...we took risk where we were just processing next batch without waiting or checking the status of old workflow instances. @NielsInc I saw that behaviour only after we upgraded to OAK 1.2.11..on which version of OAK are you ?

NielsInc commented 8 years ago

@davidjgonzalez @purnendra We are running on Oak 1.2.7 (Service Pack 1). Running the OOTB Dam Update WF. I'll take a look at the terminating mechanism as purnendra suggested.

NielsInc commented 8 years ago

@purnendra So basically you commented the following section? Instead of trying to terminate all workflows that are still running, you keep them active and carry on with the next batch?

public void run() {
                        //...

                        if (batchTimeoutCount >= batchTimeout) {
                            terminateActiveWorkflows(adminResourceResolver,
                                    contentResource,
                                    activeWorkflows);
                            // Next batch will be pulled on next iteration
                        }
                        //...
}
NielsInc commented 8 years ago

@davidjgonzalez Could this issue be related to the fact that our DAM assets are stored on S3? Maybe the communication to S3 is done different than when the assets are stored on the same server.

@purnendra Are you also using S3 storage for assets?

davidjgonzalez commented 8 years ago

@NielsInc have you tried an old version of acs commons? has some version consistenly worked?

IIRC in @purnendra's case, it was working fine until they upgraded Oak. (not idea if this is just correlation or causation)

I've been working on a re-write of BWM to let it support synthetic WF (which IMO is the right way to handle non-business workflows). I cant recall where i left it (i want to say i was just testing and hadnt found any big bugs) but i could see about cutting a snapshot release for you to test if you have lower environments (wouldnt recommend running it on a stage/prod since i havent fully tested)

badvision commented 7 years ago

@purnendra @davidjgonzalez is this issue addressed in the BWM 2 rewrite?

davidjgonzalez commented 7 years ago

Not sure; I haven't heard of it happening since ... so assume it's fixed? Thought maybe no one is using the aem wf engine anymore either..?

davidjgonzalez commented 7 years ago

This has been fixed.