lock clusters with EMR tags, not S3

Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services

http://packages.python.org/mrjob/

Other

2.62k stars 586 forks source link

lock clusters with EMR tags, not S3 #2161

Closed coyotemarin closed 4 years ago

coyotemarin commented 4 years ago

Currently, the code to "lock" pooled clusters so that two jobs won't get submitted to the same cluster simultaneously uses S3, and makes the implicit assumption that everyone using the same pool will use the same cloud_tmp_dir.

Instead, we should "lock" clusters with EMR tags, with a format something like __mrjob_pool_lock_<job key>=<timestamp>.

coyotemarin commented 4 years ago

Limitations on EMR tag names:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-restrictions.html

coyotemarin commented 4 years ago

Seems like __mrjob_pool_lock_<job_key> would be valid.

coyotemarin commented 4 years ago

It might be just as efficient in terms of API calls to have a single tag, __mrjob_pool_lock, which we set to our unique job key + a timestamp. To do this we need to:

DescribeCluster to make sure it's not already locked
AddTags to acquire the lock
DescribeCluster again to make sure we were the ones who acquired the lock

As opposed to optimistic locking where we just add our own __mrjob_pool_lock_<job_key> tag and then DescribeCluster to see who won.

However, we don't have to DescribeCluster when we release our lock, we just RemoveTags. We also don't have to worry about a cluster accumulating expired tags, since there's just one used for locking.

coyotemarin commented 4 years ago

I was thinking this might introduce a clock skew issue, but it turns out the S3 code depends on the local clock as well.

coyotemarin commented 4 years ago

It's also not clear to me that we ever need to release the lock, since it only lasts a minute, and most real jobs will take at least that long. This is mostly an issue for testing pooling, but we could handle this by patching lock expiration time to be negative in the tests.

coyotemarin commented 4 years ago

API calls can be delayed. Thinking about timing, we want to check and tag the cluster as quickly as possible, and then pause a bit and check the cluster again to make sure that another job didn't tag the cluster after us. We probably want something like:

5 seconds to describe and tag the cluster
10 seconds pause
no more than 5 seconds to check the cluster again to ensure that our tag wasn't overwritten
effectively, 40 more seconds to submit our steps and have the cluster acknowledge them

This isn't totally foolproof; for example submitting steps could get throttled and our lock could expire somewhere in the middle of it, in which case we'd probably be better off possibly sharing the cluster with another pooled job than cancelling our steps. But the first three steps can be time-constrained.

coyotemarin commented 4 years ago

We might be better off using the "raw" boto3 client rather than our wrapped connection for the locking step. boto3 itself includes retries, but you can turn them off; see the boto3 config documentation for more details (you pass this as the config keyword arg when creating a client).

coyotemarin commented 4 years ago

Fixed by #2167.