Open jfharden opened 2 months ago
I've been doing some debugging, here's what I've found so far:
When the container initially launches and the lock is already claimed, so a waiting state is happening, these processes are visible (in all these examples I'm purposefully leaving out my bash and ps commands from the hijacked session):
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 11:29 ? 00:00:00 /tmp/gdn-init
root 8 0 0 11:29 ? 00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root 20 1 0 11:29 ? 00:00:00 ssh-agent
root 33 8 0 11:29 ? 00:00:00 /opt/go/out /tmp/build/put
If you attempt to abort the task via the concourse UI, only /bin/sh /opt/resource/out /tmp/build/put
is killed (PID 8 above):
root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 11:29 ? 00:00:00 /tmp/gdn-init
root 20 1 0 11:29 ? 00:00:00 ssh-agent
root 33 1 0 11:29 ? 00:00:00 /opt/go/out /tmp/build/put
If you just allow this to run, it will eventually actually claim the lock, but the job in concourse will hang forever printing no more output:
However the /opt/go/out process will have terminated:
root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 11:29 ? 00:00:00 /tmp/gdn-init
root 20 1 0 11:29 ? 00:00:00 ssh-agent
When I kill the ssh-agent (kill 20
) the task actually finishes and I'm only left with gdn-init:
root@fb6bdc04-1dbc-4d75-7aec-24e2dc70895e:/tmp/build/put# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 11:29 ? 00:00:00 /tmp/gdn-init
However if the task managed to claim the lock, the lock is now deadlocked until you intervene
It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12
I don't see why it wouldn't and I can't replicate it locally.
It feels like this trap isn't working https://github.com/concourse/pool-resource/blob/master/assets/common.sh#L12
I don't see why it wouldn't and I can't replicate it locally.
I've setup a pipeline using a lock pool with username & password (github Personal Access Token) and the behaviour changes slightly.
While the job is waiting for the lock you still can't abort the job, it continues waiting, however once it claims the lock (assuming it eventually can) the job does then terminate as interrupted
The expected processes are running:
root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:03 ? 00:00:00 /tmp/gdn-init
root 7 0 0 15:03 ? 00:00:00 /bin/sh /opt/resource/out /tmp/build/put
root 25 7 0 15:03 ? 00:00:00 /opt/go/out /tmp/build/put
If I try to cancel the job through the UI again the bash script exits (PID 7 above) but the go command is still running
root@13059de6-1631-4b23-5a67-93d57bac4850:/tmp/build/put# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:03 ? 00:00:00 /tmp/gdn-init
root 25 1 0 15:03 ? 00:00:00 /opt/go/out /tmp/build/put
If in the hijacked session I kill the go process then the task correctly exits (kill 25
in the above example)
Sorry this shouldn't have closed, I merged a PR into a fork
Describe the bug
When trying to claim a lock which is already claimed, and therefore waiting for the lock to be released, the task does not respond correctly to cancel requests, such as a job timeout, or a cancel in the UI of concourse, or issuing a cancel via the cli.
The only way I find I can terminate the job is to hijack the container and kill a running
ssh-agent
process in the container.Reproduction steps
I think very likely related
Expected behavior
Additional context
No response