concourse / pool-resource

atomically manages the state of the world (e.g. external environments)
Apache License 2.0
54 stars 36 forks source link

Cannot cancel the build while waiting for a lock #70

Open jfharden opened 2 months ago

jfharden commented 2 months ago

Describe the bug

When trying to claim a lock which is already claimed, and therefore waiting for the lock to be released, the task does not respond correctly to cancel requests, such as a job timeout, or a cancel in the UI of concourse, or issuing a cancel via the cli.

The only way I find I can terminate the job is to hijack the container and kill a running ssh-agent process in the container.

Reproduction steps

  1. Using concourse 7.11.2 with containerd runtime
  2. Run 2 jobs trying to claim the same lock using private key auth
  3. Once a lock has been acquired by 1 job, try and cancel the second job in the concourse UI
  4. Observe that the claim task job never gets terminated and the job doesn't fail

I think very likely related

  1. Using concourse 7.11.2 with containerd runtime
  2. Configure 2 jobs which take 20 minutes to complete, and have a claim lock step with a 10 minute timeout. The locks must be configured to use private key auth
  3. Run the 2 jobs
  4. Once a lock has been acquired and held for over 10 minutes notice the timeout does not apply
  5. Wait for the running job with the lock to complete
  6. Notice the job still waiting for the lock now says claimed, but hangs forever
  7. Hijack the job which is hanging, Notice there are no tasks running in the container other than your hijack session and an ss-agent. If you kill the ssh-agent the job will immediately enter "Timeout reached" status and the lock is now deadlocked.

Expected behavior

  1. Clicking to cancel the job in the concourse UI actually cancels it
  2. Timeouts are respected

Additional context

No response

jfharden commented 2 months ago

I've been doing some debugging, here's what I've found so far:

When the container initially launches and the lock is already claimed, so a waiting state is happening, these processes are visible (in all these examples I'm purposefully leaving out my bash and ps commands from the hijacked session):

UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root           8       0  0 11:29 ?        00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root          20       1  0 11:29 ?        00:00:00 ssh-agent
root          33       8  0 11:29 ?        00:00:00 /opt/go/out /tmp/build/put

If you attempt to abort the task via the concourse UI, only /bin/sh /opt/resource/out /tmp/build/put is killed (PID 8 above):

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root          20       1  0 11:29 ?        00:00:00 ssh-agent
root          33       1  0 11:29 ?        00:00:00 /opt/go/out /tmp/build/put

If you just allow this to run, it will eventually actually claim the lock, but the job in concourse will hang forever printing no more output: Screenshot 2024-05-15 at 12 32 13

However the /opt/go/out process will have terminated:

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init
root          20       1  0 11:29 ?        00:00:00 ssh-agent

When I kill the ssh-agent (kill 20) the task actually finishes and I'm only left with gdn-init:

root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 11:29 ?        00:00:00 /tmp/gdn-init

Screenshot 2024-05-15 at 12 36 13

However if the task managed to claim the lock, the lock is now deadlocked until you intervene

jfharden commented 2 months ago

It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12

I don't see why it wouldn't and I can't replicate it locally.

jfharden commented 2 months ago

It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12

I don't see why it wouldn't and I can't replicate it locally.

I've setup a pipeline using a lock pool with username & password (github Personal Access Token) and the behaviour changes slightly.

While the job is waiting for the lock you still can't abort the job, it continues waiting, however once it claims the lock (assuming it eventually can) the job does then terminate as interrupted Screenshot 2024-05-15 at 16 01 31

The expected processes are running:

root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:03 ?        00:00:00 /tmp/gdn-init
root           7       0  0 15:03 ?        00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root          25       7  0 15:03 ?        00:00:00 /opt/go/out /tmp/build/put

If I try to cancel the job through the UI again the bash script exits (PID 7 above) but the go command is still running

root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 15:03 ?        00:00:00 /tmp/gdn-init
root          25       1  0 15:03 ?        00:00:00 /opt/go/out /tmp/build/put

If in the hijacked session I kill the go process then the task correctly exits (kill 25 in the above example) Screenshot 2024-05-15 at 16 07 19

jfharden commented 1 month ago

Sorry this shouldn't have closed, I merged a PR into a fork