containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.96k stars 2.43k forks source link

User service-lingered Kube Quadlet fails to start: "no such pod" #24468

Open jack-avery opened 3 weeks ago

jack-avery commented 3 weeks ago

Issue Description

My Kube Quadlet acts funny on a reboot. The Quadlet fails to start, complaining that the pod it is meant to create does not exist.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Create a kube.yaml with any spec
  2. Create ~/.config/containers/sytemd/testkube.kube pointing to the .yaml
  3. Reboot
  4. Run systemctl --user status testkube and podman ps -a and see that nothing is open.

Describe the results you received

systemctl --user status ansible-tf2network_dev:

× ansible-tf2network_dev.service - ansible-tf2network: dev servers
     Loaded: loaded (/home/tf2server/.config/containers/systemd/ansible-tf2network_dev.kube; generated)
     Active: failed (Result: exit-code) since Wed 2024-11-06 21:45:39 UTC; 3min 23s ago
    Process: 1314 ExecStart=/usr/local/bin/podman kube play --replace --service-container=true --network net-dev /home/tf2server/dev.yaml (code=exited, status=125)
    Process: 1506 ExecStopPost=/usr/local/bin/podman kube down /home/tf2server/dev.yaml (code=exited, status=0/SUCCESS)
   Main PID: 1314 (code=exited, status=125)
        CPU: 696ms

Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.142582762 +0000 UTC m=+0.872459413 image pull 62cc8ac7aac9384085ac142a3937dbfd3d570d6e14f508228666f881f3b0a730 srcds-dev-nae1-ri1:latest
Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.355549958 +0000 UTC m=+1.085426609 container remove d08526a767d4b35196a39b5b73f1334fb55c45e5cdcec3149d8d7c2e4ed04db3 (image=localhost/podman-pause:5>
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1314]: Error: no pod with ID 08d27f845af2a2c14e89c5dedb92c166ef1e91d51ed465f90bbf4ba989861a06 found in database: no such pod
Nov 06 21:45:39 srv609913 systemd[975]: ansible-tf2network_dev.service: Main process exited, code=exited, status=125/n/a
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1506]: Pods stopped:
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1506]: Pods removed:
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1506]: Secrets removed:
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1506]: Volumes removed:
Nov 06 21:45:39 srv609913 systemd[975]: ansible-tf2network_dev.service: Failed with result 'exit-code'.
Nov 06 21:45:39 srv609913 systemd[975]: Failed to start ansible-tf2network_dev.service - ansible-tf2network: dev servers.

This only shows as failed as I tried shoving RemainAfterExit=yes in the Quadlet file hoping it would fix this, it didn't. Running systemctl --user start ansible-tf2network_dev works immediately after and the containers stay open as expected.

journalctl --user -xu ansible-tf2network_dev:

-- Boot 065611db1fc94be2a24436ceb47ac82f --
Nov 06 21:45:38 srv609913 ansible-tf2network_dev[1314]: Pods stopped:
Nov 06 21:45:38 srv609913 ansible-tf2network_dev[1314]: Pods removed:
Nov 06 21:45:38 srv609913 ansible-tf2network_dev[1314]: Secrets removed:
Nov 06 21:45:38 srv609913 ansible-tf2network_dev[1314]: Volumes removed:
Nov 06 21:45:38 srv609913 systemd[975]: Starting ansible-tf2network_dev.service - ansible-tf2network: dev servers...
░░ Subject: A start job for unit UNIT has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit UNIT has begun execution.
░░
░░ The job identifier is 19.
Nov 06 21:45:38 srv609913 podman[1314]: 2024-11-06 21:45:38.800549389 +0000 UTC m=+0.530426040 image build  2af2a5ac188f827ce3f6d1d8b88257d2066a9e7a59d1bede3a7c04a1de273441
Nov 06 21:45:38 srv609913 podman[1314]: 2024-11-06 21:45:38.854756699 +0000 UTC m=+0.584633350 container create d08526a767d4b35196a39b5b73f1334fb55c45e5cdcec3149d8d7c2e4ed04db3 (image=localhost/podman-pause:5>
Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.113158743 +0000 UTC m=+0.843035394 container create b1d62983fd3ae0f6d06e6c9d9438be7fb5dc9422a874923fb1dab31eab6f5e19 (image=localhost/podman-pause:5>
Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.122341977 +0000 UTC m=+0.852218618 pod create 08d27f845af2a2c14e89c5dedb92c166ef1e91d51ed465f90bbf4ba989861a06 (image=, name=ansible-tf2network_dev)
Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.142582762 +0000 UTC m=+0.872459413 image pull 62cc8ac7aac9384085ac142a3937dbfd3d570d6e14f508228666f881f3b0a730 srcds-dev-nae1-ri1:latest
Nov 06 21:45:39 srv609913 podman[1314]: 2024-11-06 21:45:39.355549958 +0000 UTC m=+1.085426609 container remove d08526a767d4b35196a39b5b73f1334fb55c45e5cdcec3149d8d7c2e4ed04db3 (image=localhost/podman-pause:5>
Nov 06 21:45:39 srv609913 ansible-tf2network_dev[1314]: Error: no pod with ID 08d27f845af2a2c14e89c5dedb92c166ef1e91d51ed465f90bbf4ba989861a06 found in database: no such pod
Nov 06 21:45:39 srv609913 systemd[975]: ansible-tf2network_dev.service: Main process exited, code=exited, status=125/n/a
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit UNIT has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 125.

Describe the results you expected

The Quadlet starts correctly.

podman info output

host:
  arch: amd64
  buildahVersion: 1.38.0-dev
  cgroupControllers:
  - cpuset
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon_2.1.10+ds1-1build2_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: unknown'
  cpuUtilization:
    idlePercent: 92.33
    systemPercent: 2.72
    userPercent: 4.96
  cpus: 8
  databaseBackend: sqlite
  distribution:
    codename: noble
    distribution: ubuntu
    version: "24.04"
  eventLogger: journald
  freeLocks: 2045
  hostname: srv609913
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1004
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1004
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
  kernel: 6.8.0-48-generic
  linkmode: dynamic
  logDriver: journald
  memFree: 27620450304
  memTotal: 33654013952
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns_1.4.0-5_amd64
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.4.0
    package: netavark_1.4.0-4_amd64
    path: /usr/lib/podman/netavark
    version: netavark 1.4.0
  ociRuntime:
    name: crun
    package: crun_1.14.1-1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.14.1
      commit: de537a7965bfbe9992e2cfae0baeb56a08128171
      rundir: /run/user/1004/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt_0.0~git20240220.1e6f92b-1_amd64
    version: |
      pasta unknown version
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1004/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_1.2.1-1build2_amd64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 0h 4m 20.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/tf2server/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 0
    stopped: 1
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/tf2server/.local/share/containers/storage
  graphRootAllocated: 414921494528
  graphRootUsed: 288760504320
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 19
  runRoot: /run/user/1004/containers
  transientStore: false
  volumePath: /home/tf2server/.local/share/containers/storage/volumes
version:
  APIVersion: 5.3.0-dev
  Built: 1730929440
  BuiltTime: Wed Nov  6 21:44:00 2024
  GitCommit: c0e24c6b60cfae21ab441b40c7ea3d622d09027d
  GoVersion: go1.22.6
  Os: linux
  OsArch: linux/amd64
  Version: 5.3.0-dev

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes (? built from c0e24c6b60cfae21ab441b40c7ea3d622d09027d)

Additional environment details

Ubuntu 24.04 LTS VPS

Additional information

The kube file and quadlet file are visible in this folder. They're Jinja2 templates, but they should be straight-forward.

jack-avery commented 3 weeks ago

I have a feeling this may be a result of podman kube play --replace ... (default ExecStart for .kube Quadlets it seems) not properly removing the old pod for replacement? Above in the logs it shows it doesn't remove any pod, but likely intended to, judging by the logs that it was stopping and removing pods.

Luap99 commented 3 weeks ago

We only support the latest upstrema version so please test with podman 5.2.5 or one of the 5.3 rc's. It is possible that such problems have already been fixed.

jack-avery commented 3 weeks ago

I will take a look at trying it with upstream tomorrow.

jack-avery commented 3 weeks ago

Tried installing upstream. A very naive installation, as Ubuntu doesn't have upstream, there are no installation instructions, and I don't have a spare computer to install Fedora onto purely to test this.

User socket is present but refuses to connect:

$ podman info 
[...]
Error: unable to connect to Podman socket: Get "http://d/v5.2.5/libpod/_ping": dial unix /run/user/1001/podman/podman.sock: connect: connection refused
jack@srv609913:/run/user/1001/podman$ ls -l
total 0
srw-rw---- 1 jack jack 0 Nov  5 23:47 podman.sock

Giving up on this for now, I'll just make a workaround. If it works I'll drop my solution here. Thanks.

Edit: Apparently podman-remote is a thing. I completely overlooked this. Built from source and can connect to socket now. Will try this again.

jack-avery commented 3 weeks ago

Tested with 5.3.0-dev (c0e24c6b60cfae21ab441b40c7ea3d622d09027d), issue persists with exact same behavior. Edited original issue report.

Luap99 commented 3 weeks ago

creating runtime spec for service container: loading seccomp profile () failed: seccomp not enabled

That sounds like a totally unrelated thing. How did you build podman? You must set the seccomp build tag as any normal container will default to using a seccomp profile

jack-avery commented 3 weeks ago

I was never made aware seccomp was not enabled by default. Time to go hunting documentation to figure out how to enable this and rebuild it. I will once again do this tomorrow after university.

Luap99 commented 3 weeks ago

make binaries should set all the proper build tags by default depending on your environment I think. I think our docs are a bit of date, https://podman.io/docs/installation#build-tags. As long as the proper libraries are installed the Makefile should added the right tags by default and most users would not need to use BUILDTAGS

jack-avery commented 3 weeks ago

Ahh that's how. OK. I'll shove my keys on my laptop and try this during downtime at uni and report results (est. 12hrs from now). Fingers crossed

jack-avery commented 2 weeks ago

I've rebuilt with seccomp and this has now produced a new error. Instead of complaining about the pod existing, it now complains that the pod does not exist. Additionally none of my containers are starting now. This could be an issue with how I built it: make clean && make BUILDTAGS="seccomp systemd". Has the spec for Quadlets changed between the downstream in Ubuntu 24.04 and dev?

jack-avery commented 2 weeks ago

Something is definitely very wrong here. Restarting again is giving entirely different errors now: journalctl --user -xu ansible-tf2network_dev

Nov 06 22:01:09 srv609913 conmon[1410]: conmon 4dba44e57436a915c9c7 <nwarn>: runtime stderr: unknown version specified
Nov 06 22:01:09 srv609913 conmon[1410]: conmon 4dba44e57436a915c9c7 <error>: Failed to create container: exit status 1
Nov 06 22:01:09 srv609913 ansible-tf2network_dev[1320]: Error: OCI runtime error: crun: unknown version specified
Nov 06 22:01:09 srv609913 systemd[1049]: ansible-tf2network_dev.service: Main process exited, code=exited, status=125/n/a

The below is a standard .container file which is created using this Ansible play: journalctl --user -xu dev_relay

Nov 06 22:01:09 srv609913 podman[1276]: 2024-11-06 22:01:09.171031146 +0000 UTC m=+0.149235182 image pull f0697f3b155a506f272260a666edc10fe7f8ab88d08603c91a27d29edcf79142 docker.io/library/rust:1.74-slim-book>
Nov 06 22:01:09 srv609913 podman[1276]: 2024-11-06 22:01:09.235759735 +0000 UTC m=+0.213963751 container create 1a07a444a63af583af1e3a379c7b4cf19ef8e8d7d85cfb5d23635c1be73c7c74 (image=docker.io/library/rust:1>
Nov 06 22:01:09 srv609913 pasta[1343]: Usage: /usr/bin/pasta [OPTION]... [COMMAND] [ARGS]...
[ snip -- output of pasta --help, probably ]
Nov 06 22:01:09 srv609913 pasta[1351]: Couldn't get any nameserver address
Nov 06 22:01:09 srv609913 conmon[1370]: conmon 1a07a444a63af583af1e <nwarn>: runtime stderr: unknown version specified
Nov 06 22:01:09 srv609913 conmon[1370]: conmon 1a07a444a63af583af1e <error>: Failed to create container: exit status 1
Nov 06 22:01:09 srv609913 podman[1276]: 2024-11-06 22:01:09.572790769 +0000 UTC m=+0.550994785 container remove 1a07a444a63af583af1e3a379c7b4cf19ef8e8d7d85cfb5d23635c1be73c7c74 (image=docker.io/library/rust:1>
Nov 06 22:01:09 srv609913 dev_relay[1276]: Error: OCI runtime error: crun: unknown version specified
Nov 06 22:01:09 srv609913 systemd[1049]: dev_relay.service: Main process exited, code=exited, status=126/n/a
Luap99 commented 2 weeks ago

yeah there a lot of other changes, you need a newer crun and pasta, maybe more

jack-avery commented 2 weeks ago

I'm going to go ahead and say I'm not able to test this further, probably. I'm not the sole owner of the box, and I'm not sure how much more I can change before affecting the containers ran by the other person. Sorry for this.