copper-engine / copper-engine

COPPER - a high performance Java workflow engine
http://www.copper-engine.org/
Apache License 2.0
273 stars 71 forks source link

Offer new application restart options after crash? #67

Open theodiefenthal opened 7 years ago

theodiefenthal commented 7 years ago

I just tried out a small setup:

I create some workflows which send an asynchronous request to a remote JMS service and go to sleep for lets say 15 seconds. After 5 seconds, I let the application crash. After another 5 seconds, the server sent its responses.

As for JMS, the server can send responses to the message queue and doesn't care that the application crashes. When the application is relaunched, it can consume all the messages it should have got in the past.

If I restart the application after 20 seconds, COPPER will initialize and restart all old workflows. By this, the engine will let all workflows run into a timeout and let them (in my setup) fail. After the engine is started up, I, as a developer, can start putting the "old messages" to COPPER (Calling engine.notify). By this time, the workflows already ran into their timeout and the old responses are internally treated as early responses.

The problem is that I can't put the responses into COPPER before COPPER resumes with the old workflows.

So the question is whether and how we should offer a crash-restart routine on startup to the application developer?

We could provide a initialization-callback where the application can add responses before COPPER continues it's own initialization with restarting workflows. By doing so, we should also offer the opportunity to set the response timestamp as COPPER internally always sets the response timestamp to the time when engine.notify() is called. (In the case of JMS we could use the message-timestamp).

It is also questionalbe what should happen when the response arrived after 20 seconds but the the workflow wanted to wait only for 15 seconds. From my feelings, the answer didn't came in time so the workflow should run as if the answer didn't arrive in time. However, COPPERs timeout are always kind of lazy and not fixed. A timeout of 15 seconds means that the workflow is rerun not before 15 seconds are past but can also be rerun after hours if the application was to busy with other stuff. What do you think about this? Maybe another optional parameter for the wait method?

Another point in question is: We want to keep COPPER as small as possible and implement only features which are required by productional usage. So do you think this is a productional usecase which will hit us in future or not?

dmoebius commented 7 years ago

Hi Theo,

I just tried out a small setup:

I create some workflows which send an asynchronous request to a remote JMS service and go to sleep for lets say 15 seconds. After 5 seconds, I let the application crash. Which application? The application containing the COPPER workflows, or the JMS service?

After another 5 seconds, the server sent its responses. I assume you mean the JMS server here.

As for JMS, the server can send responses to the message queue and doesn't care that the application crashes. When the application is relaunched, it can consume all the messages it should have got in the past.

If I restart the application after 20 seconds, COPPER will initialize and restart all old workflows. By this, the engine will let all workflows run into a timeout and let them (in my setup) fail. After the engine is started up, I, as a developer, can start putting the "old messages" to COPPER (Calling engine.notify). If you notify() the "old messages" they are "new messages".

By this time, the workflows already ran into their timeout and the old responses are internally treated as early responses. Now I'm confused. What do you mean by "old responses"? Where did they come from? And which workflows run into a timeout, and why? You said you restarted COPPER after a crash. COPPER will just resume the workflows from their latest checkpoint. COPPER doesn't set any workflows into a "timeout" state, that's entirely your decision, and depends on the implementation of your workflow logic.

The problem is that I can't put the responses into COPPER before COPPER resumes with the old workflows. I understand it like this: your JMS message queue still holds some responses from the JMS server which haven't yet been processed. When COPPER starts you have no control about which one runs first: either the resumed existing COPPER workflows, or the new COPPER workflows which are initiated by the JMS queue processor. Is that correct?!?

It's true, you cannot influence the start order of workflows. COPPER is not designed to do this. To enforce a certain order would mean serialisation, would mean synchronisation, and COPPER doesn't have synchronisation. If you really want this you must add some custom serialisiation to your own workflow logic.

So the question is whether and how we should offer a crash-restart routine on startup to the application developer? As a last resort, if you look into the implementations of AbstractSqlDialect and its subclasses, you see that workflows are dequeued from COP_QUEUE using ORDER BY priority, last_mod_ts. So you can fiddle with those two fields directly in database just before you start the COPPER engine. For example, you can give all queued workflow instances a higher priority:

UPDATE COP_QUEUE SET priority = priority + 1;

If you do this right before the engine starts, all existing workflow instances will run first, before any other newly created workflow instance gets selected.

We could provide a initialization-callback where the application can add responses before COPPER continues it's own initialization with restarting workflows. By doing so, we should also offer the opportunity to set the response timestamp as COPPER internally always sets the response timestamp to the time when engine.notify() is called. (In the case of JMS we could use the message-timestamp). No need for an initialization-callback. As a developer you always have control about when COPPER starts and what to do before that.

It is also questionalbe what should happen when the response arrived after 20 seconds but the the workflow wanted to wait only for 15 seconds. From my feelings, the answer didn't came in time so the workflow should run as if the answer didn't arrive in time. This is clearly a timeout. The response arrives too late, so it gets discarded.

However, COPPERs timeout are always kind of lazy and not fixed. A timeout of 15 seconds means that the workflow is rerun not before 15 seconds are past but can also be rerun after hours if the application was to busy with other stuff. What do you think about this? I doubt this can happen. The processor pools wake up every 50ms (default) to check if there a queued workflow exists (see PersistentPriorityProcessorPool). Did you really observe this?

Maybe another optional parameter for the wait method?

Another point in question is: We want to keep COPPER as small as possible and implement only features which are required by productional usage. So do you think this is a productional usecase which will hit us in future or not? Well, such a situation never occurred in the last 5 years in our applications at SCOOP, so I doubt this.

Best regards, Dirk -- SCOOP Software GmbH - Gut Maarhausen - Eiler Straße 3 P - D-51107 Köln Dirk Möbius

T +49 221 801916-167 - F +49 221 801916-17 - M +49 175 5930071 dirk.moebius@scoop-software.de - www.scoop-software.de Sitz der Gesellschaft: Köln, Handelsregister: Köln, Handelsregisternummer: HRB 36625 Geschäftsführung: Dr. Oleg Balovnev, Frank Heinen, Dr. Wolfgang Reddig, Roland Scheel