botify-labs / simpleflow

Python library for dataflow programming.
https://botify-labs.github.com/simpleflow/
MIT License
68 stars 24 forks source link

Heartbeat timeout on long running tasks #359

Open nstott opened 5 years ago

nstott commented 5 years ago

Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.

My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.

When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.

The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.

I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach

I'm not sure if this is related to #239