max_clusters_in_pool, pool_timeout_minutes, and bypassing of pool_wait_minutes

Cluster pooling tries to ensure jobs re-use existing clusters rather than creating their own. However, this can go wrong when multiple jobs start simultaneously (the "thundering herd" problem).

This patch creates the max_clusters_in_pool option (fixes #2192 ) to check if the pool is already "full" of clusters before creating one. If it is, the job will wait until a cluster is available to join or one or more clusters in the pool terminate. To handle the "thundering herd" problem, once we've determined that we're allowed to create a new cluster, we wait a random number of seconds and double-check. This is controlled by the pool_jitter_seconds option (fixes #2200).

The pool_wait_minutes option also handles the "thundering herd" problem badly, since every cluster will wait, even though there isn't any cluster to wait for. mrjob will now bypass pool_wait_minutes if there aren't active clusters which match our job's pool name and hash (fixes #2198). Like with max_clusters_in_pool, we wait a random number of seconds and double-check before launching our own cluster.

Previously, the pool_wait_minutes option was interpreted to mean we should refuse to wait any longer than this many minutes. Now we wait at least this many minutes before launching a new cluster (unless there is no cluster we could plausibly wait for).

Finally, this change adds the pool_timeout_minutes option, which causes mrjob to raise an exception and bail out if a pooled job has been unable to join or a create a cluster (fixes #2199).

Yelp / mrjob

max_clusters_in_pool, pool_timeout_minutes, and bypassing of pool_wait_minutes #2202