Closed coyotemarin closed 4 years ago
Limitations on EMR tag names:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-restrictions.html
Seems like __mrjob_pool_lock_<job_key>
would be valid.
It might be just as efficient in terms of API calls to have a single tag, __mrjob_pool_lock
, which we set to our unique job key + a timestamp. To do this we need to:
DescribeCluster
to make sure it's not already lockedAddTags
to acquire the lockDescribeCluster
again to make sure we were the ones who acquired the lockAs opposed to optimistic locking where we just add our own __mrjob_pool_lock_<job_key>
tag and then DescribeCluster
to see who won.
However, we don't have to DescribeCluster
when we release our lock, we just RemoveTags
. We also don't have to worry about a cluster accumulating expired tags, since there's just one used for locking.
I was thinking this might introduce a clock skew issue, but it turns out the S3 code depends on the local clock as well.
It's also not clear to me that we ever need to release the lock, since it only lasts a minute, and most real jobs will take at least that long. This is mostly an issue for testing pooling, but we could handle this by patching lock expiration time to be negative in the tests.
API calls can be delayed. Thinking about timing, we want to check and tag the cluster as quickly as possible, and then pause a bit and check the cluster again to make sure that another job didn't tag the cluster after us. We probably want something like:
This isn't totally foolproof; for example submitting steps could get throttled and our lock could expire somewhere in the middle of it, in which case we'd probably be better off possibly sharing the cluster with another pooled job than cancelling our steps. But the first three steps can be time-constrained.
We might be better off using the "raw" boto3 client rather than our wrapped connection for the locking step. boto3 itself includes retries, but you can turn them off; see the boto3 config documentation for more details (you pass this as the config
keyword arg when creating a client).
Fixed by #2167.
Currently, the code to "lock" pooled clusters so that two jobs won't get submitted to the same cluster simultaneously uses S3, and makes the implicit assumption that everyone using the same pool will use the same
cloud_tmp_dir
.Instead, we should "lock" clusters with EMR tags, with a format something like
__mrjob_pool_lock_<job key>=<timestamp>
.