catalyst / moodle-tool_dataflows

A generic workflow and processing engine which can be configured to do a large variety of tasks.
GNU General Public License v3.0
11 stars 6 forks source link

Resolve how the execution engine handles aborts and exceptions. #88

Open jaypha opened 2 years ago

jaypha commented 2 years ago

Currently, the engine is designed to catch any exceptions, and then instruct all steps to abort, allowing them all to gracefully shut down.

The question is then what to do next. Should the engine re throw the exception, or should is return control to the code that called it, and leave it up to it?

Perhaps, more precisely, should this line exist or not? https://github.com/catalyst/moodle-tool_dataflows/blob/492a04a5309a80e955a2143df8d5fcc5d91e04fa/classes/execution/engine.php#L221

jaypha commented 2 years ago

For an example, let's say it is a scheduled task running a dataflow trying to read from a database, and sending it to an external URL, but for some reason the external site is down and an exception is thrown.

The engine step catches this. It then sets the status to ABORTED, and returns this to then engine. The engine calls every step to abort, allowing them to gracefully shut down.

Then the engine can do one of two things. It can either re-throw the exception, or it can return control back to the task object. The task object would then be required to check the abort status of the engine and then decide itself what to do.

keevan commented 2 years ago

Before the step aborts, I expect there to be some level of 'retry' configured in the step, but managed by the engine as to re-running attempts.

But after the step aborts, I would then expect it to check dataflow conditions to see if it should stop the whole dataflow or not based on that step aborting (not yet implemented but I would expect some level of "try but if you fail then continue anyways")

After that has been exhausted and the dataflow should be aborted, and all operations halted (e.g. graceful cleanup of steps). Then I would expect it to throw the exception all the way up to the caller probably, unless there is a good reason for the engine to handle the error itself.

The way the engine can be executed is through a scheduled task, an adhoc task or directly (manually triggered). For all these possible options it seems like they should already have a way of gracefully displaying the exception if one was thrown.

--

Regarding whether or not that line should exist, I think it should just to allow the user or 'runner' to see exactly that processing that dataflow resulted in an error. If this was a task that processed various dataflows (e.g. scheduled task), then I would expect dataflows that were okay to still run, instead of blowing up then and there, but this can be added as part of the task itself, and the engine can still pass the exception along.

jaypha commented 2 years ago

The ability to retry would be dependent on the step, and thus be apart of the step design.

According to spec, once a step aborts, the whole dataflow must abort. If other parts of the dataflow are able to continue, then it set's it's state to CANCELLED.

jaypha commented 2 years ago

Pasting in Brendan's comment from the chat "... my gut feel would be that if a step hits some unrecoverable state then it handles it, and then the last thing it does is tell the manager to abort I don't think it needs to do much more the other question is what happens when a step just explodes and doesn't call abort, do we want the manager to gracefully handle it in a similar way to how the task manager handles this and does the faildelay logic on the other hand if the latter is implemented maybe the former is dedundant"

keevan commented 2 years ago

the other question is what happens when a step just explodes and doesn't call abort, do we want the manager to gracefully handle it

If we can't trust the step to call abort, perhaps it should bubble to the engine and the engine should call abort itself? The step only needs to then throw exceptions, and handle anything it should in it's on_abort callback.

And yes if that's the definition of aborting, then an exception causing it to happen should also ensure it cleans up after itself. There's probably the idea and potential to need to differentiate between an expected & handled abort (e.g. if things don't line up, just stop the dataflow - as part of the step logic), vs something unexpected went wrong with something that wasn't caught by the step, so abort it.