Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

minimize API calls when joining pooled clusters #2174

Closed coyotemarin closed 4 years ago

coyotemarin commented 4 years ago

This pull request is based off #2169, so please look at that first

This change probably makes pooling as efficient in terms of API calls as possible.

The main thing we do is put as much pooling-relevant information as we can in the job name (mrjob version, pool name, and pool hash), fixing #2160. This means we can screen out most non-matching clusters when we ListClusters.

We then sort clusters by NormalizedInstanceHours divided by whole number of hours since the cluster was ready, providing a rough estimate of which has the greatest CPU capacity.

Finally, we go through the sorted list of clusters, and do a final check of the cluster's instance groups/fleets (ListInstanceGroups/Fleets) and cluster info (DescribeCluster) that can't be exactly matched (e.g. subnet, EBS group volume size). Once we find a matching cluster, we yield it, and don't look at the other clusters until there's some issue joining them, such as failure to lock the cluster. Locking itself uses DescribeCluster, so we share the final DescribeCluster call with the locking logic.

A lot more information is in the pool hash than used to be, including whether the cluster uses instance fleets or instance groups. However, mrjob version is not in the pool hash because it's a separate field in the cluster name. From now on, it's not possible to join a pooled cluster launched by a different version of mrjob (even if it doesn't bootstrap mrjob).