Closed ansoncfit closed 5 years ago
Related discussion in https://github.com/conveyal/r5/issues/417
As discussed in a meeting yesterday, a simpler and perhaps more robust way to address this might be to require all workers to report in once every N seconds even when they don't need any tasks. This is covered by #167.
In an example we're seeing now, testing 1000 simulated schedules, worker throughput is about 5-8 tasks per minute. If a worker does not complete
MAX_TASKS_PER_WORKER
(16) inWORKER_RECORD_DURATION_MSEC
(2 minutes), the broker purges it from the worker catalog. And if no workers for a given worker category are in the worker catalog when the broker tries to distribute tasks, it starts a new on-demand instance.If the broker repeatedly starts extra on-demand instances, it could easily hit the AWS limit (20 on-demand instances), preventing other users from starting workers.
To mitigate this ,
purgeDeadWorkers()
could checkworkerObservation.workerStatus.loadAverage
and exempt busy workers from purging. A more thorough solution would be change the main AnalystWorker polling behavior in R5, making workers check in with the broker regularly even if they are not ready for additional tasks.