containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.04k stars 2.35k forks source link

Error forwarding signal 18 (sometimes 15) to container ... #8086

Closed asottile closed 2 years ago

asottile commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

This is distilled down from a larger example, reproducing only with coreutils xargs / timeout + sleep

This is a pretty reliable example:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: error sending signal to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: `/usr/lib/cri-o-runc/sbin/runc kill d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c 18` failed: exit status 1 
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: error sending signal to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: `/usr/lib/cri-o-runc/sbin/runc kill 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283 18` failed: exit status 1 

occasionally I'll get outputs like this:

+ timeout 2 podman run --rm runner-image:latest python3 -c 'import time; time.sleep(5)'
ERRO[0002] Error forwarding signal 18 to container 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11: can only kill running containers. 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11 is in state stopped: container state improper 

(runner-image is this one)

My actual use case ends like this however:

time="2020-10-19T15:07:21-07:00" level=error msg="container not running"
time="2020-10-19T15:07:21-07:00" level=error msg="Error forwarding signal 15 to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: error sending signal to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: `/usr/lib/cri-o-runc/sbin/runc kill 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb 15` failed: exit status 1"

With -i, it occasionally deadlocks, breaking most podman commands (such as podman ps for that user). the only way I've been able to recover the deadlock is to kill -9 that podman process

+ timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'

(hung forever)

strace of podman ps shows what looks like a spinlock with a timeout:

$ strace podman ps

...

futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL

the signals in question are:

15: SIGTERM 18: SIGCONT

Steps to reproduce the issue:

  1. See commands above

Describe the results you received:

See output above

Describe the results you expected:

I expect things to exit gracefully, and not deadlock

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.15.2
Built:        Wed Dec 31 16:00:00 1969
OS/Arch:      linux/amd64

Output of podman info --debug:

$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.20, commit: '
  cpus: 5
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: babibox
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-51-generic
  linkmode: dynamic
  memFree: 5532704768
  memTotal: 8348299264
  ociRuntime:
    name: runc
    package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
    path: /usr/lib/cri-o-runc/sbin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.4
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
  swapFree: 1964396544
  swapTotal: 1964396544
  uptime: 5h 13m 0.85s (Approximately 0.21 days)
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/asottile/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 2
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/asottile/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 23
  runRoot: /run/user/1000/containers
  volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

$ apt list podman
Listing... Done
podman/unknown,now 2.1.1~2 amd64 [installed]
podman/unknown 2.1.1~2 arm64
podman/unknown 2.1.1~2 armhf
podman/unknown 2.1.1~2 s390x

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

VirtualBox + aws

asottile commented 3 years ago

ah reproduced the deadlock again,

here's the process tree when it's locked:

podman,11738 run --rm -i runner-image:latest python3 -c import time; time.sleep(5)
  ├─(slirp4netns,11752)
  ├─{podman},11739
  ├─{podman},11740
  ├─{podman},11741
  ├─{podman},11742
  ├─{podman},11743
  ├─{podman},11745
  ├─{podman},11746
  └─{podman},11749

stracing that:

$ strace -p 11738
strace: Process 11738 attached
--- stopped by SIGTTIN ---
mheon commented 3 years ago

SIGTTIN pauses the process it's sent to; if you hit Podman with that while we're holding a lock, then yes, I fully expect it will effectively deadlock until execution restarts. It's probably getting in before we get sig-proxy configured to proxy signals into the container (otherwise it's be the container process getting the signal).

That snippet you provided seems to be very aggressively sending terminal flow control signals to Podman. The warning messages there are narrow race conditions with signals being sent after the container has exited but before podman run has exited, and we're getting enough signals to hit them consistently. The deadlock is more concerning - something sent us a SIGTTIN to pause execution but never sent a SIGCONT to resume it. Podman can't do anything about this - we received a stop signal and respected it.

asottile commented 3 years ago

the xargs was just to make it more reproducible -- it reproduces on its own just very rarely

I have for instance a script which is just:

$ cat t.sh 
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'

I'm not sure what could be sending SIGTTIN -- timeout only seems to send SIGTERM (is it possible podman is accidentally signaling itself here?) -- I could only reproduce this deadlock with -i (maybe something related to setting tty flags?)

the actual case where I encountered the error was with a much larger timeout value (180s), which should be quite long enough for podman to be up and ready -- but I didn't have the patience to run that in a loop

mheon commented 3 years ago

Let me take a look and see if I can reproduce

mheon commented 3 years ago

The "Error forwarding signal" bit definitely reproduces.

First step is probably to drop that down to an Info level log if it's caused by the container being in a bad state - that will stop spamming the terminal with logs, at least.

mheon commented 3 years ago

I can see a potential race where Podman will allow the Podman process to be killed, but the container to survive, if a signal arrives in a narrow window between signal proxying starting and the container starting (we start proxying before the container is fully started to ensure we don't miss sending anything). I don't think that precisely matches what's going on here, but I will see what can be done about fixing it.

mheon commented 3 years ago

Partial fix in https://github.com/containers/podman/pull/8191

rhatdan commented 3 years ago

@asottile Could you try again with the code in master branch. @mheon got a partial fix in, see if this helps your issue.

asottile commented 3 years ago

ok adjusting the above script:

timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

it looks like --rm is broken so I can't really run this in a loop:

$ ./bin/podman ps -a
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES
$ time timeout 2 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

real    0m2.003s
user    0m0.533s
sys 0m1.194s
$ ./bin/podman ps -a
CONTAINER ID  IMAGE                               COMMAND               CREATED        STATUS   PORTS   NAMES
b88107396029  ghcr.io/pre-commit-ci/runner-image  python3 -c import...  4 seconds ago  Created          inspiring_merkle
mheon commented 3 years ago

Oh, you're seeing that --rm is broken as well? I saw it locally on one of my machines, but other team members didn't. Interesting.

mheon commented 3 years ago

Can you confirm if --rm is working as root?

asottile commented 3 years ago

I've never run podman as root, hope I'm doing this right!

$ sudo ./bin/podman run --rm -ti ubuntu:focal echo hi
[sudo] password for asottile: 
Trying to pull docker.io/library/ubuntu:focal...
Getting image source signatures
Copying blob 6a5697faee43 done  
Copying blob a254829d9e55 done  
Copying blob ba13d3bc422b done  
Copying config d70eaf7277 done  
Writing manifest to image destination
Storing signatures
hi
$ sudo ./bin/podman ps -a
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES
mheon commented 3 years ago

Yes - that looks right. Rootless only, then. Thanks!

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

Is this still an issue with the current main branch?

asottile commented 3 years ago

the --rm issue appears to be fixed, the deadlock is not:

$ bash -x t.sh
+ timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
$ pstree -halp 12047
timeout,12047 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
  └─podman,12048 run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c import time; time.sleep(5)
      ├─(slirp4netns,12060)
      ├─{podman},12049
      ├─{podman},12050
      ├─{podman},12051
      ├─{podman},12052
      ├─{podman},12053
      ├─{podman},12055
      ├─{podman},12056
      └─{podman},12058
$ strace -p 12048
strace: Process 12048 attached
--- stopped by SIGTTIN ---
^Cstrace: Process 12048 detached

I'm also still seeing the "Error forwarding signal 18" issue:

$ timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
ERRO[0000] container not running                        
container not running
ERRO[0004] Error forwarding signal 18 to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: error sending signal to container bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc: `/usr/bin/runc kill bc8b042991732be239cfb2a27e5c5de26debde570dcf2628da0c3485fea8acdc 18` failed: exit status 1 
$ git rev-parse HEAD
9c5fe954cca8b4bcb8f552645e1f52a5d9824134
rhatdan commented 3 years ago

@mheon any more thoughts on this?

mheon commented 3 years ago

On the deadlock: we're being sent a SIGTTIN and pausing execution, at a very inconvenient time (in the middle of a critical section). Podman is behaving as expected in my view, and if we're never sent a signal to un-pause execution then the lock remains held and Podman will deadlock. We could look into doing something similar to the signal-inhibit logic we added to the create logic, to inhibit signals like SIGTTIN while in critical sections, I suppose, but I'm a bit leery at the performance implications of adding/removing a signal handler every time we take a lock.

It also sounds like we still have some races in --sig-proxy but I have doubts we will ever fully eliminate those. Forwarding signals at this rate was never the intended purpose of --sig-proxy (it was only intended to forward user-sent SIGTERM and similar into the container, not handle any serious volume of signals). Given that it's just a warning message I hesitate to say we should put that much effort in there - might be worth a "Good First Issue" tag?

asottile commented 3 years ago

@mheon it isn't really a rapid succession of signals -- just one SIGTERM is enough to reproduce this. The original xargs command is just to make it happen more often (but it reproduces in serial)

I'm a little confused where the SIGTTIN is coming from at all -- if I'm reading correctly, the docs on this indicate that podman would have to attempt to read from stdin while backgrounded in order to receive that signal

mheon commented 3 years ago

When Podman is run attached with -i we are always reading from STDIN to forward it into the container. So podman run -i ... will always be reading from STDIN.

Do you have a reproducer with a single SIGTERM? The timing conditions here are extremely narrow, and you're the only person I've heard report them. Are you on a particularly slow VM or embedded system?

asottile commented 3 years ago

When Podman is run attached with -i we are always reading from STDIN to forward it into the container. So podman run -i ... will always be reading from STDIN.

hmmm, but it's not backgrounded so I'm confused why it gets SIGTTIN

I filled a shell script with this:

timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'
timeout 4 bin/podman run --rm -i ghcr.io/pre-commit-ci/runner-image python3 -c 'import time; time.sleep(5)'

and when it doesn't deadlock I get the SIGTERM issue about ~5% of the time (which I showed here: https://github.com/containers/podman/issues/8086#issuecomment-713191267)

I'm in a VM, but I wouldn't consider it slow

asottile commented 3 years ago

also the timing conditions aren't narrow -- I'm seeing this with a 180 second timeout command -- just the minimal reproduces I've been showing are much smaller because I don't think anyone has the patience to run a 3 minute timeout in a loop to debug (I wouldn't even call a 4 second timeout narrow either)

bitstrings commented 3 years ago

Podman v2.2.1

Hitting this bug also.

ERRO[0000] container not running                        
ERRO[0000] Error forwarding signal 15 to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: error sending signal to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: `/usr/bin/runc kill b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a 15` failed: exit status 1 
bitstrings commented 3 years ago

Using CTRL+C on and the entrypoint traps SIG and the container is not started with -it we sometime get the error. However, most of the time the container terminates before a clean shutdown (i.e.: trap 'shutdown' SIGTERM SIGINT ...).

Using -it seems to make things stable again.

svdHero commented 3 years ago

Same problem here with Podman version 2.2.1.

I have an ASP.NET web service running inside a container. When I do a podman stop -a the container and the web service inside is shutdown gracefully. However, if I do a CTRL+C, then I get the output:

ERRO[0000] container not running                        
ERRO[0003] Error forwarding signal 15 to container 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110: error sending signal to container 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110: `/usr/bin/runc kill 7a9d095b3287e18c8eb131a0101f64dd7105a0181580e74a295d7570c26cd110 15` failed: exit status 1

This happens every time.

realJayNay commented 3 years ago

I am also experiencing this issue on Podman version 2.2.1, when pressing Ctrl-C to shutdown a container that was started via podman run. Consistently reproducible.

I can confirm that the issue does not occur when the container is started via podman run -it.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

bitstrings commented 3 years ago

This is still an issue podman 3.0.1.

rhatdan commented 3 years ago

@mheon Any update or new thoughts?

mheon commented 3 years ago

If people are seeing this with single signals and any sort of consistency, it's not the race I was thinking about. Looking at the error messages, I suppose we're looking at a situation where Podman isn't catching the container transitioning from Running to Stopped, but in that case a lot more would be broken.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

asottile commented 3 years ago

still a problem in 3.0.1:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
2021-04-12T00:20:00.000654469Z: open pidfd: No such process
ERRO[0001] Error forwarding signal 18 to container ab5560e0634a3a3466545770d7def57ea01fedfca530332f0a9d6833d5b7582a: error sending signal to container ab5560e0634a3a3466545770d7def57ea01fedfca530332f0a9d6833d5b7582a: `/usr/bin/crun kill ab5560e0634a3a3466545770d7def57ea01fedfca530332f0a9d6833d5b7582a 18` failed: exit status 1 
$ podman --version
podman version 3.0.1
bitstrings commented 3 years ago

I think it looks better in 3.1.0 but since I can't run rootless I didn't test enough.

rhatdan commented 3 years ago

I believe this is fixed. Reopen if I am mistaken.

asottile commented 3 years ago

(can't reopen since it was closed by a maintainer)

this is still an issue in 3.1.0 using the reproduction in the original report:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
ERRO[0002] Error forwarding signal 18 to container eee9d2ce56463512dc9f12ac70be4b47ac5f988f14d2deb18bf2686af2da81e6: container has already been removed 
$ podman --version
podman version 3.1.0
github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

psilvaggio commented 3 years ago

This is still an issue.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

asottile commented 3 years ago

this is still an issue

rhatdan commented 3 years ago

@mheon Any progress on this?

bitstrings commented 3 years ago

I've been using 3.1.2 for awhile and I don't get the issue. Maybe it's my use case only.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

asottile commented 3 years ago

I'm still on 3.2.3 (apt hasn't updated yet) but it's still an issue there:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
2021-09-01T00:36:14.000528871Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: error sending signal to container e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4: `/usr/bin/crun kill e9011c0f5fd9d602d6c0468be5d48efacd6860b8365e3b02dbcedb2fa9eaa6b4 18` failed: exit status 1 
2021-09-01T00:36:14.000538067Z: open pidfd: No such process
ERRO[0002] Error forwarding signal 18 to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: error sending signal to container cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3: `/usr/bin/crun kill cc03ab0a3f5a8d49710b33c6949ffd6cf4046b492eeed9551d3128ebda9e0ce3 18` failed: exit status 1 
$ podman version
Version:      3.2.3
API Version:  3.2.3
Go Version:   go1.15.2
Built:        Wed Dec 31 19:00:00 1969
OS/Arch:      linux/amd64
bitstrings commented 3 years ago

Ok, no, it still seems broken, if use stress-ng -c24 in the background, then yes I get the same errors.

> podman version
Version:      3.3.0
API Version:  3.3.0
Go Version:   go1.16.6
Built:        Fri Aug 20 15:36:14 2021
OS/Arch:      linux/amd64
asottile commented 3 years ago

@rhatdan this isn't fixed, @bitstrings edited their comment after posting

bitstrings commented 3 years ago

It's not fixed. It just doesn't happen as often on my desktop on low load. It is however still an issue.

vrothberg commented 2 years ago

I tried reproducing but did not manage to.

@asottile, can you share podman info --debug?

asottile commented 2 years ago

here it is, basically the same as in the original post but with updated versions

a note: this machine is running virtualized so it's significantly slower (and others have noted that it requires a loaded system otherwise) so that may help you to reproduce

$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.22.3
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.27, commit: '
  cpus: 5
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: babibox
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.11.0-34-generic
  linkmode: dynamic
  memFree: 6699016192
  memTotal: 8345640960
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.20.1.5-925d-dirty
      commit: 0d42f1109fd73548f44b01b3e84d04a279e99d2e
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 1964396544
  swapTotal: 1964396544
  uptime: 1m 49.01s
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/asottile/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.5
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/asottile/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 7
  runRoot: /run/user/1000/containers
  volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 0
  BuiltTime: Wed Dec 31 19:00:00 1969
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/amd64
  Version: 3.3.1
vrothberg commented 2 years ago

Thanks!

@giuseppe should we do some exponential back off when forwarding the signal?

giuseppe commented 2 years ago

@giuseppe should we do some exponential back off when forwarding the signal?

do not we risk to send the signal multiple times this way? Would we repeat it only when we get an error from the OCI runtime?