flowable / flowable-engine

A compact and highly efficient workflow and Business Process Management (BPM) platform for developers, system admins and business users.
https://www.flowable.org
Apache License 2.0
8.01k stars 2.63k forks source link

Async jobs do not resume following application restart #3848

Closed Sbruk closed 9 months ago

Sbruk commented 9 months ago

Hello Flowable team!

Describe the bug It looks like async service tasks stored in act_ru_job do not properly resume when the application is restarted. Their rev_ keeps increasing indefinitely but the task is never executed and the logs are filled with the following:

2024-02-15 08:34:09,579 DEBUG [org.flo.job.ser.imp.asy.AcquireAsyncJobsDueRunnable] (flowable-bpmn-acquire-async-jobs) async job acquisition for engine bpmn, thread woke up
2024-02-15 08:34:09,579 DEBUG [org.flo.job.ser.imp.per.ent.Job.selectJobsToExecute] (flowable-bpmn-acquire-async-jobs) ==>  Preparing: SELECT RES.* from ACT_RU_JOB RES WHERE SCOPE_TYPE_ is null and LOCK_EXP_TIME_ is null order by RES.ID_ asc LIMIT ? OFFSET ?
2024-02-15 08:34:09,579 DEBUG [org.flo.job.ser.imp.per.ent.Job.selectJobsToExecute] (flowable-bpmn-acquire-async-jobs) ==> Parameters: 512(Integer), 0(Integer)
2024-02-15 08:34:09,583 DEBUG [org.flo.job.ser.imp.per.ent.Job.selectJobsToExecute] (flowable-bpmn-acquire-async-jobs) <==      Total: 1
2024-02-15 08:34:09,584 DEBUG [org.flo.job.ser.imp.per.ent.Job.updateJob] (flowable-bpmn-acquire-async-jobs) ==>  Preparing: update ACT_RU_JOB SET REV_ = ?, LOCK_EXP_TIME_ = ?, LOCK_OWNER_ = ? where ID_= ? and REV_ = ?
2024-02-15 08:34:09,584 DEBUG [org.flo.job.ser.imp.per.ent.Job.updateJob] (flowable-bpmn-acquire-async-jobs) ==> Parameters: 304(Integer), 2024-02-15 09:34:09.583(Timestamp), 9e7f18d3-fb56-41d5-8aa3-5393987694c2(String), 97234(String), 303(Integer)
2024-02-15 08:34:09,585 DEBUG [org.flo.job.ser.imp.per.ent.Job.updateJob] (flowable-bpmn-acquire-async-jobs) <==    Updates: 1
2024-02-15 08:34:09,588 DEBUG [org.flo.job.ser.imp.asy.AcquireAsyncJobsDueRunnable] (flowable-bpmn-acquire-async-jobs) Jobs acquired: 1, rejected: 0, for engine bpmn
2024-02-15 08:34:09,588 DEBUG [org.flo.job.ser.imp.asy.AcquireAsyncJobsDueRunnable] (flowable-bpmn-acquire-async-jobs) acquired and queued new jobs for engine bpmn; sleeping for 10000 ms
2024-02-15 08:34:09,588 DEBUG [org.flo.job.ser.imp.asy.AcquireAsyncJobsDueRunnable] (flowable-bpmn-acquire-async-jobs) async job acquisition for engine bpmn, thread sleeping for 10000 millis
2024-02-15 08:34:09,588 DEBUG [org.flo.job.ser.imp.cmd.LockExclusiveJobCmd] (flowable-async-job-executor-thread-55) Executing lock exclusive job 97234 97230
2024-02-15 08:34:09,592 DEBUG [org.flo.job.ser.imp.asy.ExecuteAsyncRunnable] (flowable-async-job-executor-thread-55) Could not lock exclusive job. Unlocking job so it can be acquired again. Caught exception: Could not lock process instance

The issue seems to be related to the entry in act_ru_job being locked twice by flowable, the second attempt therefore throwing a FlowableOptimisticLockingException. The first one is done in the following method of AcquireAsyncJobsDueRunnable.java:

protected long acquireAndExecuteJobs(CommandExecutor commandExecutor, int remainingCapacity)

The second one in the following method of ExecuteAsyncRunnable.java, which is called by AcquireAsyncJobsDueRunnable.offerJobs():

protected boolean lockJob()

Expected behavior I would expect existing async tasks to be resumed and executed properly when the application is restarted.

Code N/A

Additional context I'm working with the following stack:

filiphr commented 9 months ago

@Sbruk from what I can see in the lock it seems like the job cannot get executed because the process instance is exclusively locked:

2024-02-15 08:34:09,588 DEBUG [org.flo.job.ser.imp.cmd.LockExclusiveJobCmd] (flowable-async-job-executor-thread-55) Executing lock exclusive job 97234 97230
2024-02-15 08:34:09,592 DEBUG [org.flo.job.ser.imp.asy.ExecuteAsyncRunnable] (flowable-async-job-executor-thread-55) Could not lock exclusive job. Unlocking job so it can be acquired again. Caught exception: Could not lock process instance

This means that either there are multiple async exclusive jobs for a single process instance or the process instance is locked, but it has never been released.

Locking a job from act_ru_job is not what is happening.

protected long acquireAndExecuteJobs(CommandExecutor commandExecutor, int remainingCapacity)

this is indeed locking a job for executing it. Then in ExecuteAsyncRunnable#lockJob the lock is done to do an exclusive lock on the process instance if the job is an exclusive lock.

How many jobs are there for the process instance of the execution with id 97230?

Sbruk commented 9 months ago

Hello @filiphr, Thanks for your prompt answer!

Following the restart, for a given process_instance_id_ there is exactly one entry in act_ru_job. Please find below some additional information:

Thanks again for your help!

filiphr commented 9 months ago

I see that you are using some dummy model. Is this easily reproducible? Can you perhaps create a minimal example that we can run and test?

My serviceTask does not define explicitly the flag exclusive and is using a JavaDelegate class for its execution.

The default is that it is exclusive, so it you do not set it then it is true. It is an opt-out value. So you need to put exclusive to be false for it to be disable.

In any case, can you perhaps also share the act_ru_execution table. I assume that it would have the lock_time_ and lock_owner_ columns set.

Since you are not using Spring Boot. Are you properly closing the engine when the application shuts down? There is ProcessEngine#close that would do

if (asyncExecutor != null) {
    commandExecutor.execute(new ClearProcessInstanceLockTimesCmd(asyncExecutor.getLockOwner()));
}

that would clear the lock time on the process instances. If you wait for 1 hour those jobs would eventually get executed, 1 hour is the default lock time for the jobs, which is the same time the process instance would get locked.

Sbruk commented 9 months ago

Indeed, I've missed the lock on the act_ru_execution table during my test / manual operations. I've done some extra tests this morning following your feedback and it looks all good now. Regarding the shutdown, I'm more interested in the case of an unexpected shutdown so I cannot guarantee the execution of the close method anyway... Thanks a lot for your help @filiphr !

filiphr commented 9 months ago

Glad that it got resolved for you @Sbruk.

Indeed if the application shuts down without going through all the close methods it can be in a state that you are seeing. It will eventually be cleared up, after 1 hour passes. I mentioned the close, because the jobs should have been locked as well (if there was an unexpected shutdown).