containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.43k stars 2.31k forks source link

Quadlet: pod fails to start, but unit is reported as online #20667

Open ItalyPaleAle opened 7 months ago

ItalyPaleAle commented 7 months ago

Issue Description

I have a Quadlet unit which starts a pod (for Traefik). The Pod's spec contains a port which binds to a specific network interface on the host.

Sometimes, systemd tries to start the Quadlet unit even though the network interface isn't yet ready. The pod fails to start, but systemd reports that the unit is active anyways.

Because systemd reports the unit as active, it is not restarted automatically.

Steps to reproduce the issue

  1. Create a Kubernetes unit with a port that binds to a specific network interface
  2. While the network interface isn't ready, start the systemd unit
  3. The pod fails to start, but the unit is reported as active

Describe the results you received

Output of systemctl status traefik.service:

● traefik.service - Traefik service
     Loaded: loaded (/etc/containers/systemd/traefik.kube; generated)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: active (running) since Sun 2023-11-12 18:39:16 UTC; 4h 43min ago
   Main PID: 1223 (conmon)
      Tasks: 1 (limit: 18798)
     Memory: 2.4M
        CPU: 571ms
     CGroup: /system.slice/traefik.service
             └─1223 /usr/bin/conmon --api-version 1 -c 3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760 -u 3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760/userdata -p /run/containers/storage/overlay-containers/3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760/userdata/pidfile -n e099b62746ef-service --exit-dir /run/libpod/exits --full-attach -s -l k8s-file:/var/lib/containers/storage/overlay-containers/3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760/userdata/ctr.log --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760/userdata/oci-log --conmon-pidfile /run/containers/storage/overlay-containers/3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg "" --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 3ea92eda1f949d138d4163ee3560e685430d584309d51ac26e627edce5b77760

Nov 12 18:39:16 pistacchio traefik[1052]: Pod:
Nov 12 18:39:16 pistacchio traefik[1052]: db62e6224ebd58bda991eb4d118ed4eb80fe2a1e938ac2447cb0ac091084f766
Nov 12 18:39:16 pistacchio traefik[1052]: Containers:
Nov 12 18:39:16 pistacchio traefik[1052]: f3970a296e8a731c652692aba704e06804bc0ace01cbe13bf62c945224a8a6b7
Nov 12 18:39:16 pistacchio traefik[1052]: 2d4a37c9bc108da06af13d3b34252994cf57b4b282cc6a4e5fe0caa2dfb2d52e
Nov 12 18:39:16 pistacchio traefik[1052]: starting container 61dfa2529de56741a870b6fcdc154723b7c079c9b11de8e1c5809df347cbfafe: cannot listen on the TCP port: listen tcp4 xx.xx.xx.xx:80: bind: cannot assign requested address
Nov 12 18:39:16 pistacchio traefik[1052]: starting container 2d4a37c9bc108da06af13d3b34252994cf57b4b282cc6a4e5fe0caa2dfb2d52e: a dependency of container 2d4a37c9bc108da06af13d3b34252994cf57b4b282cc6a4e5fe0caa2dfb2d52e failed to start: container state improper
Nov 12 18:39:16 pistacchio traefik[1052]: starting container f3970a296e8a731c652692aba704e06804bc0ace01cbe13bf62c945224a8a6b7: a dependency of container f3970a296e8a731c652692aba704e06804bc0ace01cbe13bf62c945224a8a6b7 failed to start: container state improper
Nov 12 18:39:16 pistacchio traefik[1052]: Error: failed to start 3 containers
Nov 12 18:39:16 pistacchio systemd[1]: Started traefik.service - Traefik service.

Describe the results you expected

If the pod fails to start, the unit should be in a failed state, so systemd can try restarting the pod.

podman info output

- podman version 4.7.0
- Fedora CoreOS Stable amd64 (current version: 38.20231027.3.2)
- Systemd version 253.12-1.fc38

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

No response

Additional information

No response

ygalblum commented 7 months ago

@ItalyPaleAle thanks for reporting this issue.

I was able to reproduce it in a simpler way and will look further into it. I think the root cause is not specific to Quadlet but rather to podman kube play, but I need to investigate further.

For reference, I was able to reproduce this issue in the following way.

Manually run nginx and publish the container port 80 on the host's 8000:

podman run --name manual-nginx -d --rm -p 8000:80 docker.io/library/nginx:latest

Generate the nginx.yml:

podman generate kube manual-nginx > nginx.yml

Edit the file to update the pod and container names.

# Save the output of this file and use kubectl create -f to import
# it into Kubernetes.
#
# Created with podman-4.7.2
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  containers:
  - args:
    - nginx
    - -g
    - daemon off;
    image: docker.io/library/nginx:latest
    name: nginx
    ports:
    - containerPort: 80
      hostPort: 8000

Save nginx.yml under ~/.config/containers/systemd.

Create a nginx.kube file;

[Kube]
Yaml=nginx.yml

Reload the daemon:

systemctl --user daemon-reload

Start the service

systemctl --user start nginx.service

Check its status

systemctl --user start nginx.service

You can see that it is running, but the containers failed to start

ygalblum commented 7 months ago

As I suspected, the root cause of this issue is in kube play. According to this, Podman notifies READY=1 regardless to the success or failure of the containers. As a result, systemd thinks that the service is running while in fact it has failed.

In addition, I can see that the service container is still running and conmon does not return.

@vrothberg I remembered that https://github.com/containers/podman/pull/18671 was aiming to address such cases. But, could it be that it fails to do so if the containers failed to start?

vrothberg commented 7 months ago

@ygalblum that sounds plausible. I am a bit under water at the moment and do not find time to debug.