HealthCmd interval in quadlet not being followed + transient timers not cleaned up

lespea commented 5 months ago

Issue Description

With the latest update of podman (v5.1.0) I noticed that in my quadlet definitions the HealthInterval is not being followed but instead the HealthStartupInterval is. Moreover the transient .timer files are being left behind whenever the service is stop/restarted causing many error logs to fill be generated since the container is no longer running but the healthcmd continues to be retried (in my case every few seconds for every container).

Quadlet def:

[Unit]
Description=Podman test
After=network.target
Requires=network.target

[Container]

# General
Image=docker.io/nginx
AutoUpdate=registry
ContainerName=testing
Pull=Always
RunInit=true
Timezone=local

# Health
HealthCmd=sleep 2
HealthInterval=5m
HealthStartupCmd=sleep 5
HealthStartupInterval=2s
HealthStartupTimeout=30s
Notify=healthy

[Service]
Restart=always
RestartSec=10s
RestartSteps=5
RestartMaxDelaySec=5m

[Install]
WantedBy=multi-user.target default.target

Transient logs persisting:

> cd /run/systemd/transient

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl restart test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl stop test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

Example of a transient service/timer

> bat 2f*
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Service]
   6   │ LogLevelMax=5
   7   │ Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/bin"
   8   │ ExecStart=
   9   │ ExecStart="/usr/bin/podman" "healthcheck" "run" "2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6"
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Timer]
   6   │ OnUnitInactiveSec=2s
   7   │ AccuracySec=1s
   8   │ RemainAfterElapse=no
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Steps to reproduce the issue

Setup the quadlet as described in the description; reload the systemd daemon and start the service
Notice that the HealthStartupCmd is run, sleep 5
Notice that every 2s the HealthCmd is being run, sleep 2
Stop the service, notice that in journalctl every 2 seconds there is an error log for the startup timer/service since those containers no longer exist
Look in /var/run/systemd/transient/ to see the old timers/services
- You can also see how OnUnitInactiveSec=2s is in the timers which is the interval for the startup health check not the normal one

Describe the results you received

Timers removed on service reset/stop

Describe the results you expected

Initial health cmd runs it's cmd/interval then once health the normal cmd runs its cmds/interval. Also the checks should be removed whenever the container is restarted/shutdown.

podman info output

host:
  arch: amd64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 98.39
    systemPercent: 0.47
    userPercent: 1.15
  cpus: 16
  databaseBackend: sqlite
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  freeLocks: 2047
  hostname: hydra.lespea.org
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.9.3-zen1-1-zen
  linkmode: dynamic
  logDriver: journald
  memFree: 45088784384
  memTotal: 67316936704
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: /usr/lib/podman/aardvark-dns is owned by aardvark-dns 1.11.0-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.11.0
    package: /usr/lib/podman/netavark is owned by netavark 1.11.0-1
    path: /usr/lib/podman/netavark
    version: netavark 1.11.0
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.15-1
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta is owned by passt 2024_05_23.765eb0b-1
    version: |
      pasta 2024_05_23.765eb0b
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 0h 31m 2.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 412300083200
  graphRootUsed: 142776213504
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.1.0
  Built: 1717105841
  BuiltTime: Thu May 30 16:50:41 2024
  GitCommit: 4e9486dbc63c24bfe109066abbb54d5d8dc2489e-dirty
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

Luap99 commented 5 months ago

@mheon PTAL

mheon commented 5 months ago

To be certain: did this work with Podman 5.0?

lespea commented 5 months ago

I've used a setup similar to this for a while, yes. I'm not 100% sure when it started occurring but I'm pretty sure it was right after 5.1.0. When I get home from work I can try rolling back to the previous version to double check that's when the errors started.

Luap99 commented 5 months ago

The only relevant change I remember going into 5.1 is https://github.com/containers/podman/pull/22589/commits/4fd84190b8858bb3b20b994e76954e75ba31cf9c

So maybe worth to try to revert this and test then

mheon commented 5 months ago

@lespea Does this reproduce outside of Quadlet? Something like podman run --health-cmd=... --health-internal=... using the same values as your previous script?

I'm hopeful it doesn't because we have CI that should catch such things

Luap99 commented 5 months ago

This reproduces podman run --name c1 --health-cmd true --health-interval 15s --health-start-period 30s --health-startup-cmd true --health-startup-interval 5s -d quay.io/libpod/testimage:20240123 sleep inf

Our CI doesn't check for leaked transient units, the healthchecks are running fine it is just the cleanup which is failing to remove the timer

Luap99 commented 5 months ago

Reverting your change makes it work again.

Luap99 commented 5 months ago

Problems is this code https://github.com/containers/podman/blob/c510959826cdc55e6a75c40b104a9d1aa28e3632/libpod/healthcheck.go#L282-L297

Your new code only uses one field for the unit name and createTimer overwrites the startup hc with the real hc name so removeTransientFiles then removes the real hc timer and thus leaks the startup hc timer.

lespea commented 5 months ago

Ugh sorry this has been an absolute insane week. Really appreciate the fast fix/release and I can confirm that v5.1.1 works as expected!

containers / podman