Just a few tasks left unfinished in regional jobs

conveyal / analysis-backend

Server component of Conveyal Analysis

MIT License

23 stars 12 forks source link

Recently we've been seeing regional jobs which stall at almost 100% with just a few tasks left unfinished.

This is likely caused by workers shutting down while there are still tasks to be completed. It is probably a timing issue, an interaction between the definition of worker idle time and the length of time it takes to process a task.

This problem likely appeared since the workers' internal queue was lengthened.

First, we need to ensure that workers do not shut down while they have tasks in their queue or are actively processing a task. We can just add an explicit check of the queue along with the idle time check. We don't want to rely entirely on one or the other in case the workers are stuck idle with work in the queue, or they haven't finished a task in a while but still have a supply of work.

Second, we need to add a retry mechanism so when workers are killed by something beyond our control (which can easily happen with EC2 spot instances) the tasks are redelivered. There are already mechanisms that enable redelivery.

The above diagnosis is just theoretical - in practice I'm seeing that per-task times are not much more than 10 seconds, so are probably not stalling the whole worker for the REGIONAL_KEEPALIVE_MINUTES = 1. In addition we have just observed jobs stalling while over 300 workers were seen to be still alive and idle.

So the tasks must be lost at some other point - the main candidate is where their results are returned to the broker. Digging through the code we see that by default, for the backend HTTP handler, SparkFramework is creating a pool of 200 threads. It's imaginable that 300 workers polling 200 threads in a dense part of the region could lead to timeouts or errors, causing some results to be lost.

conveyal / analysis-backend

Just a few tasks left unfinished in regional jobs #144