concourse / pool-resource

atomically manages the state of the world (e.g. external environments)
Apache License 2.0
54 stars 36 forks source link

Implement an atomic lock check action. #37

Closed josecv closed 5 years ago

josecv commented 6 years ago

Analogous to a hardware lock check, this will atomically try to acquire a specified lock (and wait until it can), then immediately release the lock.

Co-authored-by: Rafay rafay@autonomic.ai

pivotal-issuemaster commented 6 years ago

@josecv Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-issuemaster commented 6 years ago

@josecv Thank you for signing the Contributor License Agreement!

josecv commented 6 years ago

docs have been added

vito commented 6 years ago

I'm interested in hearing the use case for this before merging, in the interest of minimizing knobs. What does this solve over having two puts? What part of your workflow requires just waiting on a lock without holding it before continuing? (Is that really safe?)

josecv commented 6 years ago

we use locks to implement environment freezes. in short, it's safe to have multiple deployments running at once, but at certain points we might choose to prevent all deployments. when a deployment job runs, it attempts to fetch a lock: if it can, that means the environment isn't frozen and deployments are permitted; the lock is immediately released so that a simultaneous deployment is allowed through.

when we need to actually freeze an env, we manually place the lock in claimed state and don't unclaim it until we want the environment to thaw.

at first we did implement this using two puts; unfortunately this turns out quite problematic. if folks interrupt their deploy (or indeed if the deploy crashes) at just the right time, it's possible for the deploy to exit while holding on to the lock, thus deadlocking an environment until the lock is manually released. we considered doing an on_abort or ensure step to release the lock but this turns out to be quite tricky. if the claim dies after it's managed to put the lock but before it's managed to get the lock, then the release will fail whether or not it's inside an ensure (since the lock metadata is not available to it). if we put a get before the release to make sure that the release will succeed, then we run the chance of an aborted deploy unlocking an intentionally locked environment.

an atomic check operation solves this problem, since it happens all at once.

vito commented 6 years ago

Interesting. This will indeed help, but it still won't completely save you from aborts. Someone could click 'abort' and cause the put operation itself to be interrupted in between the claim and unclaim.

Going back to your use case, it actually sounds a bit like a read-write lock or exclusive/shared acquire semantics. This is discussed in some form as https://github.com/concourse/pool-resource/issues/35 and https://github.com/concourse/concourse/issues/842. Some form of that feels like a better approach to me - the utility of the check verb here was not immediately apparent. I may be missing something here but it also seems like there's nothing preventing the 'freezer' or 'cleanup' job (i.e. the exclusive acquirer) from running while existing deploys are in flight, because they just immediately release their lock.

I'd be happy to work with y'all towards a proposed API/approach if you're interested.

vito commented 5 years ago

Sorry but I'm gonna close this out; I'm still fishy on this approach to the problem. Happy to hash out the other stuff I mentioned at some point though.

alnavart commented 2 years ago

Hi! I'm experiencing the same issue than @josecv commented on https://github.com/concourse/pool-resource/pull/37#issuecomment-391384885 Are there any update on this approach about atomic operation? or maybe disallowing to abort on some specific steps..