Broker starts extra workers when existing workers are under heavy load

ansoncfit commented 6 years ago

In an example we're seeing now, testing 1000 simulated schedules, worker throughput is about 5-8 tasks per minute. If a worker does not complete MAX_TASKS_PER_WORKER (16) in WORKER_RECORD_DURATION_MSEC (2 minutes), the broker purges it from the worker catalog. And if no workers for a given worker category are in the worker catalog when the broker tries to distribute tasks, it starts a new on-demand instance.

If the broker repeatedly starts extra on-demand instances, it could easily hit the AWS limit (20 on-demand instances), preventing other users from starting workers.

To mitigate this , purgeDeadWorkers() could check workerObservation.workerStatus.loadAverage and exempt busy workers from purging. A more thorough solution would be change the main AnalystWorker polling behavior in R5, making workers check in with the broker regularly even if they are not ready for additional tasks.

ansoncfit commented 6 years ago

Related discussion in https://github.com/conveyal/r5/issues/417

abyrd commented 6 years ago

As discussed in a meeting yesterday, a simpler and perhaps more robust way to address this might be to require all workers to report in once every N seconds even when they don't need any tasks. This is covered by #167.

conveyal / analysis-backend

Broker starts extra workers when existing workers are under heavy load #165