Yelp / mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
http://packages.python.org/mrjob/
Other
2.62k stars 586 forks source link

cluster locks are never released #2162

Closed coyotemarin closed 4 years ago

coyotemarin commented 4 years ago

Currently, our pooling code will "lock" a cluster to make sure that two jobs won't get simultaneously added to the same pooled cluster.

However, there isn't really a mechanism for releasing a lock. The way it's worked until now is that the lock is effectively expired once a job adds its step(s), since the lock is based on both the cluster ID and the number of steps it has. Now that we're avoiding looking at pooled clusters' steps (see #2159), we need to release the lock explicitly. Basically, as soon as the cluster is no longer WAITING, it's safe to release the lock.

It also makes sense to expire cluster locks rather quickly. Basically, we need long enough for the user to submit steps and for the cluster to go into the RUNNING state (which might happen immediately; need to check this).

coyotemarin commented 4 years ago

Looks like it takes maybe 10 seconds for the cluster to notice the new steps and switch into the RUNNING state.

Probably best to have the lock automatically expire after a minute. Plus we can unlock it earlier if we see the cluster go into a non-WAITING state or on runner cleanup, or in _relaunch().