flowable / flowable-engine

A compact and highly efficient workflow and Business Process Management (BPM) platform for developers, system admins and business users.
https://www.flowable.org
Apache License 2.0
7.81k stars 2.59k forks source link

Endless job processing loop when using JMS/ActiveMQ #2442

Open shawnrusaw-wf opened 4 years ago

shawnrusaw-wf commented 4 years ago

Describe the bug

When using JMS/ActiveMQ as per https://flowable.com/open-source/docs/bpmn/ch18-Advanced/#message-queue-based-async-executor, the Jobs can get into an endless lock fail / retry loop.

This initially surfaced as a lock failure described in this forum thread https://forum.flowable.org/t/update-on-act-ru-execution/4949. After deeper investigation, the traceback included

 at org.flowable.common.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30)
 at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:56)
 at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:51)
 at org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.lockJob(ExecuteAsyncRunnable.java:190)
 at org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.run(ExecuteAsyncRunnable.java:113)
 at org.flowable.spring.executor.jms.JobMessageListener.onMessage(JobMessageListener.java:40)

And with a little debug logging we are able to track this down to a single job ID that retries 100s of times per second.

Expected behavior

The job should (eventually) fail and be put into the ACT_RU_DEADLETTER_JOB table.

Code

We use the example https://flowable.com/open-source/docs/bpmn/ch18-Advanced/#message-queue-based-async-executor to configure our Async JMS integration.

Additional context

Flowable Version : 6.5.0 Flowable in a Spring container, but NOT Spring Boot. ActiveMQ

shawnrusaw-wf commented 4 years ago

the execution pointed at by the job is not committed to the database

shawnrusaw-wf commented 4 years ago

We have not isolated a root cause yet. It does not happen all the time, once the bad ACT_RUJOB is deleted, things clear up. It has happened about 3 or 4 times total over the last 2 weeks. At current load, we have about 10 or so async jobs per second. It’s not too much. In certain respects it looks like an application level deadlock with Flowable’s custom locking code. The REV field was slowly increasing, but the number of retries per REV_ increment was in the 10s of thousands.

shawnrusaw-wf commented 4 years ago

With a bit more debug logging we were able to find a pattern where for the misbehaving jobIds, MessageBasedJobManager.sendMessage was called (approximately) the same number of times as JobMessageListener.onMessage (10s of thousands of times). This may (or may not) be how Flowable normally handles lock retries.

shawnrusaw-wf commented 4 years ago

Do you happen to know if the access to the failing process instance is limited to the async jobs only? Or are there other interactions (e.g. task completes, etc.) happening concurrently on those instances? - I'm pretty sure we only use async jobs at this point. I will get confirmation from the owner of that part of the system.

EDIT: confirmed, we only use async jobs

shawnrusaw-wf commented 4 years ago

This is the relevant stack trace. We are now trying the mysql jdbc driver to see if it is more stable.

org.apache.ibatis.exceptions.PersistenceException: 
### Error updating database.  Cause: java.sql.SQLException: (conn=7513859) Communications link failure with primary host redacted.rds.amazonaws.com:3306. Connection timed out
### The error may exist in org/flowable/db/mapping/entity/Execution.xml
### The error may involve org.flowable.engine.impl.persistence.entity.ExecutionEntityImpl.updateProcessInstanceLockTime-Inline
### The error occurred while setting parameters
### SQL: update ACT_RU_EXECUTION     set       LOCK_TIME_ = ?     where ID_ = ?       and (LOCK_TIME_ is null OR LOCK_TIME_ < ?)
### Cause: java.sql.SQLException: (conn=7513859) Communications link failure with primary host redacted.rds.amazonaws.com:3306. Connection timed out
 at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
 at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:199)
 at org.flowable.common.engine.impl.db.DbSqlSession.update(DbSqlSession.java:107)
 at org.flowable.engine.impl.persistence.entity.data.impl.MybatisExecutionDataManager.updateProcessInstanceLockTime(MybatisExecutionDataManager.java:326)
 at org.flowable.engine.impl.persistence.entity.ExecutionEntityManagerImpl.updateProcessInstanceLockTime(ExecutionEntityManagerImpl.java:980)
 at org.flowable.engine.impl.cfg.DefaultInternalJobManager.lockJobScope(DefaultInternalJobManager.java:148)
 at org.flowable.job.service.impl.cmd.LockExclusiveJobCmd.execute(LockExclusiveJobCmd.java:56)
 at org.flowable.engine.impl.interceptor.CommandInvoker$1.run(CommandInvoker.java:51)
 at org.flowable.engine.impl.interceptor.CommandInvoker.executeOperation(CommandInvoker.java:93)
 at org.flowable.engine.impl.interceptor.CommandInvoker.executeOperations(CommandInvoker.java:72)
 at org.flowable.engine.impl.interceptor.CommandInvoker.execute(CommandInvoker.java:56)
 at org.flowable.engine.impl.interceptor.BpmnOverrideContextInterceptor.execute(BpmnOverrideContextInterceptor.java:25)
 at org.flowable.common.engine.impl.interceptor.TransactionContextInterceptor.execute(TransactionContextInterceptor.java:53)
 at org.flowable.common.engine.impl.interceptor.CommandContextInterceptor.execute(CommandContextInterceptor.java:72)
 at org.flowable.common.spring.SpringTransactionInterceptor.lambda$execute$0(SpringTransactionInterceptor.java:56)
 at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
 at org.flowable.common.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:56)
 at org.flowable.common.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:30)
 at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:56)
 at org.flowable.common.engine.impl.cfg.CommandExecutorImpl.execute(CommandExecutorImpl.java:51)
 at org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.lockJob(ExecuteAsyncRunnable.java:190)
 at org.flowable.job.service.impl.asyncexecutor.ExecuteAsyncRunnable.run(ExecuteAsyncRunnable.java:113)
 at org.flowable.spring.executor.jms.JobMessageListener.onMessage(JobMessageListener.java:40)

I think that a timeout prior to reading 0 https://github.com/flowable/flowable-engine/blob/master/modules/flowable-engine/src/main/java/org/flowable/engine/impl/persistence/entity/data/impl/MybatisExecutionDataManager.java#L327 causes the endless loop.

shawnrusaw-wf commented 4 years ago

Switching to the mysql connector (https://mvnrepository.com/artifact/mysql/mysql-connector-java) has mitigated this issue. There is still an issue with the behavior of async jobs that "timeout" when trying to acquire that lock, but this seems like a good workaround.

PinoEire commented 1 year ago

Hi there. I’m looking at Flowable reliability and support to evaluate its implementation in a critical system.

I’m worried about finding bugs like this one that have no answers and is still open after a long time. Is there a rationale behind this?