containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
24.02k stars 2.43k forks source link

support User= in systemd for running rootless services #12778

Closed Gchbg closed 1 year ago

Gchbg commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST?

/kind bug

Description

I want to have a systemd system service that runs a rootless container under an isolated user, but systemd rejects the sd_notify call and terminates the service.

Got notification message from PID 15150, but reception only permitted for main PID 14978

A similar problem was menitoned but not resolved in #5572, which seems to have been closed without a resolution.

Happy to help tracking this down.

Steps to reproduce the issue:

  1. Start with a Debian testing system. Create a system user with an empty home dir, and enable lingering:
groupadd -g 200 nginx
useradd -r -s /usr/sbin/nologin -l -b /var/lib -M -g nginx -u 200 nginx
usermod -v 165536-231071 -w 165536-231071 nginx
mkdir -m 770 /var/lib/nginx
nginx:nginx /var/lib/nginx
loginctl enable-linger nginx
  1. Use this unit file, adapted from podman generate systemd --new:
❯ cat /etc/systemd/system/nginx.service
[Unit]
Description=Nginx
Wants=network-online.target
After=network-online.target

[Service]
WorkingDirectory=/var/lib/nginx
User=nginx
Group=nginx
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=no
TimeoutStopSec=70
Type=notify
NotifyAccess=all
ExecStartPre=/bin/rm -f %T/%N.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%T/%N.ctr-id --replace --rm -d --sdnotify=conmon --cgroups=no-conmon --name nginx nginx:mainline
ExecStop=/usr/bin/podman stop --cidfile=%T/%N.ctr-id -i
ExecStopPost=/usr/bin/podman rm --cidfile=%T/%N.ctr-id -f -i
KillMode=none

[Install]
WantedBy=default.target

❯ sudo systemctl daemon-reload
  1. Start the unit:
❯ sudo systemctl start nginx

Describe the results you received:

Jan 09 14:54:00 Cubert systemd[1]: /etc/systemd/system/nginx.service:24: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Jan 09 14:54:00 Cubert systemd[1]: Starting Nginx...
Jan 09 14:54:00 Cubert systemd[14978]: Started podman-15150.scope.
Jan 09 14:54:00 Cubert podman[15150]: Resolving "nginx" using unqualified-search registries (/etc/containers/registries.conf)
Jan 09 14:54:00 Cubert podman[15150]: Trying to pull docker.io/library/nginx:mainline...
Jan 09 14:54:03 Cubert podman[15150]: Getting image source signatures
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:589b7251471a3d5fe4daccdddfefa02bdc32ffcba0a6d6a2768bf2c401faf115
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a0bcbecc962ed2552e817f45127ffb3d14be31642ef3548997f58ae054deb5b2
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a9edb18cadd1336142d6567ebee31be2a03c0905eeefe26cb150de7b0fbc520b
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:b4df32aa5a72e2a4316aad3414508ccd907d87b4ad177abd7cbd62fa4dab2a2f
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:a2abf6c4d29d43a4bf9fbb769f524d0fb36a2edab49819c1bf3e76f409f953ea
Jan 09 14:54:03 Cubert podman[15150]: Copying blob sha256:186b1aaa4aa6c480e92fbd982ee7c08037ef85114fbed73dbb62503f24c1dd7d
Jan 09 14:54:12 Cubert podman[15150]: Copying config sha256:605c77e624ddb75e6110f997c58876baa13f8754486b461117934b24a9dc3a85
Jan 09 14:54:12 Cubert podman[15150]: Writing manifest to image destination
Jan 09 14:54:12 Cubert podman[15150]: Storing signatures
Jan 09 14:54:12 Cubert podman[15150]:
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.101247642 +0200 EET m=+11.607938154 container create 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>, PODMAN_SYSTEMD_UNIT=nginx.service)
Jan 09 14:54:12 Cubert systemd[14978]: Started libcrun container.
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:00.536382139 +0200 EET m=+0.043073791 image pull  nginx:mainline
Jan 09 14:54:12 Cubert systemd[1]: user@200.service: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.141137063 +0200 EET m=+11.647827815 container init 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>)
Jan 09 14:54:12 Cubert systemd[1]: user@200.service: Got notification message from PID 15150, but reception only permitted for main PID 14978
Jan 09 14:54:12 Cubert podman[15150]: 2022-01-09 14:54:12.145611861 +0200 EET m=+11.652302766 container start 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>)
Jan 09 14:54:12 Cubert podman[15150]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
Jan 09 14:54:12 Cubert conmon[15215]: /docker-entrypoint.sh: Configuration complete; ready for start up
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: using the "epoll" event method
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: nginx/1.21.5
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: OS: Linux 5.15.0-2-amd64
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 524288:524288
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker processes
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: start worker process 26
Jan 09 14:54:12 Cubert systemd[14978]: Started podman-15271.scope.
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 3 (SIGQUIT) received, shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: gracefully shutting down
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exiting
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 26#26: exit
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: signal 17 (SIGCHLD) received from 26
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: worker process 26 exited with code 0
Jan 09 14:54:12 Cubert conmon[15215]: 2022/01/09 12:54:12 [notice] 1#1: exit
Jan 09 14:54:12 Cubert podman[15299]: 2022-01-09 14:54:12.393064442 +0200 EET m=+0.052274069 container remove 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a (image=docker.io/library/nginx:mainline, name=nginx, PODMAN_SYSTEMD_UNIT=nginx.service, maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>)
Jan 09 14:54:12 Cubert podman[15271]: 7c7de83a412558d9ef53592734d3a52df9eecf331f696acfcdaac0ce33cf4c2a
Jan 09 14:54:12 Cubert systemd[14978]: podman-15150.scope: Consumed 7.547s CPU time.
Jan 09 14:54:12 Cubert systemd[1]: nginx.service: Failed with result 'protocol'.
Jan 09 14:54:12 Cubert systemd[1]: Failed to start Nginx.

Describe the results you expected:

Nginx runs until the end of time.

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.5
Built:        Thu Jan  1 02:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpus: 1
  distribution:
    distribution: debian
    version: unknown
  eventLogger: journald
  hostname: Cubert
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 200
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-2-amd64
  linkmode: dynamic
  logDriver: journald
  memFree: 1015083008
  memTotal: 2041786368
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/200/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 0
  swapTotal: 0
  uptime: 8h 1m 8.23s (Approximately 0.33 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/lib/nginx/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/nginx/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 0
  runRoot: /run/user/200/containers
  volumePath: /var/lib/nginx/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Thu Jan  1 02:00:00 1970
  GitCommit: ""
  GoVersion: go1.17.5
  OsArch: linux/amd64
  Version: 3.4.4

Package info (e.g. output of apt list podman):

podman/testing,now 3.4.4+ds1-1 amd64 [installed]

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes and yes.

Additional environment details (AWS, VirtualBox, physical, etc.):

Machine is a VM.

mheon commented 2 years ago

This is a limitation on the systemd side. They will only accept notifications, or PID files, that are created by or sent by root, for security reasons - even if the User and Group of the unit file are explicitly set to start the process as a non-root user. Their recommendation was to start the container as a user service of the user in question via systemctl --user. There have been a few other issues about this, I'll try and dig them up.

eriksjolund commented 2 years ago

Previous discussion: https://github.com/containers/podman/discussions/9642 It contains links to some issues.

Gchbg commented 2 years ago

Thank you both. For now I've worked around it by managing the service under the user's systemd which is clunky to say the least. I don't understand systemd's security argument - if the process is run as a given user, why would systemd not allow that user's process to send sd_notify? Who else could? But I guess this is no flaw of podman.

9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

I guess you could close this issue or use it to track progress.

vrothberg commented 2 years ago

9642 mentions some code changes that need to happen to podman for sd_notify, what are those? And have they progressed since March?

Yes, there is some progress. The main PID is now communicated via sd notify but there are still some remaining issues. For instance, %t resolves to the root's runtime dir - even when User=foo is set.

vrothberg commented 2 years ago

I think the next big thing to tackle is finding a way how to lift the User= setting. While the process in ExecStart itself is run as the specified User/Group, the systemd specifiers (e.g., %t, %U, etc) remain to be root.

Gchbg commented 2 years ago

[...] The main PID is now communicated via sd notify [...]

But even that is rejected by systemd, as seen in the logs above.

vrothberg commented 2 years ago

I fear there's not much Podman can do at the moment.

wc7086 commented 2 years ago

Only after solving this problem can become truly rootless.

So I have to keep using the root account for now.

svdHero commented 2 years ago

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

wc7086 commented 2 years ago

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

vrothberg commented 2 years ago

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

Gchbg commented 2 years ago

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

For the moment my workaround is to run such containers in a systemd --user. This means that for every system service I want to run as a rootless container, I need to create a separate system user, enable linger, and run a separate systemd --user instance for that user.

It works but it's clunky, e.g. restarting Nginx is sudo su -l nginx -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" systemctl --user restart nginx' and running a command inside such a container might be something like sudo su -l nextcloud -s /bin/sh -c 'XDG_RUNTIME_DIR="/run/user/$(id -u)" DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/$(id -u)/bus" podman exec -u www-data -w /var/www/html nextcloud ./occ status'.

Inside these rootless containers root is mapped to the system user, which is a different uid for each service. If something inside the containers runs as non-root, that gets mapped to a high-numbered host uid by default. However with some magic on the host you can map a specific non-root uid in the container to a host uid of your choice, which can then be mapped to a different non-root uid in a different container running under a different user.

I should probably document my setup one of these days...

eriksjolund commented 2 years ago

@Gchbg If you are running a recent systemd version (for instance by running Fedora 35), I think you could run

sudo systemd-run --machine=nginx@ --quiet --user --collect --pipe --wait systemctl --user restart nginx

No need to set DBUS_SESSION_BUS_ADDRESS and XDG_RUNTIME_DIR

svdHero commented 2 years ago

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid

use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

@vrothberg

Is there a quick overview what, at the moment, the best approach / workaround is for starting podman containers with systemd as a specific non-root user?

The services need to be started and managed as the specific non-root user. Using the User= directive does not work yet.

How does that relate to what @Gchbg and @eriksjolund wrote above? Do I have to run several instances of systemd or is there another way?

For systemd beginners like me, it is quite difficult to understand the various layers of abstraction and user permission between systemd, host processes and containers. It would be really helpful to have a complete example in the podmand generate docs, that shows how to start a container or pod under a specific user during boot time.

After all, I would assume that this is the use case for 80 % of the users: run some container service that gets restarted automatically when the machine boots and that is as restricted as possible (by means of user permissions).

wc7086 commented 2 years ago

@wc7086

Furthermore, if a container is run as root, is there a workaround how to change the ownership of files and directories created inside the container (in a bound volume) to a specific host user?

use -e PUID=useruid -e PGID=usergid use id username check UID and GID

Is that -e as in the podman run option --env for environment variables?

I got it wrong, modifying UID and GID via env requires entrypoint.sh。

https://docs.docker.com/engine/security/userns-remap/ Most of the docker documentation applies to podman.

grooverdan commented 2 years ago

I think the next big thing to tackle is finding a way how to lift the User= setting. While the process in ExecStart itself is run as the specified User/Group, the systemd specifiers (e.g., %t, %U, etc) remain to be root.

With the %t cidfile removed in #13236, what are the remaining requirements? Does it matter if RequiresMountsFor=%t/containers uses the User %t rather than root?

vrothberg commented 2 years ago

So far https://github.com/containers/podman/issues/13236 is an issue. To be sure it's working, we need a pull request :)

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

benyaminl commented 2 years ago

[...] The main PID is now communicated via sd notify [...]

But even that is rejected by systemd, as seen in the logs above.

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

Gchbg commented 2 years ago

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

Could you please describe this in more detail? I'm curious how it compares to my workaround.

benyaminl commented 2 years ago

For now I use tmux to run the systemctl service from rootless podman. It works even after I detach or close ssh connection, because it kept the user logged in 🤣

Could you please describe this in more detail? I'm curious how it compares to my workaround.

It's just simple work around as I'm kepping tmux running, it means I'm always logged in, so the systemd user service will kept running as simple as that. It's just a silly ways for me for now.

Anyway loginctl should close this issue I think. I talk across folks on /r/podman, but it require root user first to allow user service running in background after boot.

runiq commented 2 years ago

Just a quick heads-up: The commandline from https://github.com/containers/podman/issues/12778#issuecomment-1026163660:

sudo systemd-run --machine=nginx@ --quiet --user --collect --pipe --wait systemctl --user restart nginx

can be simplified to:

sudo systemctl --user -M nginx@ restart nginx
github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 2 years ago

This looks like it is fixed in the current release. Reopen if I am mistaken.

jklaiho commented 2 years ago

@rhatdan I've been following this issue and just installed Podman 4.2.0 from source on a Ubuntu 22.04 system. The discussion went to a lot of places, but: should User= and Group= actually work now? Because I couldn't manage it.

As a minimal example, I'm using the hashicorp/http-echo image. Here's the unit file generated for it using podman generate systemd --name --new --no-header --container-prefix "" hw, saved as /etc/systemd/system/hw.service:

[Unit]
Description=Podman gallant_mahavira.service
Documentation=man:podman-generate-systemd(1)
Wants=network-online.target
After=network-online.target
RequiresMountsFor=%t/containers

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run \
     --cidfile=%t/%n.ctr-id \
     --cgroups=no-conmon \
     --rm \
     --sdnotify=conmon \
     -d \
     -p 5678:5678 hashicorp/http-echo "-text=hello world"
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

This runs fine, with the cosmetic issue of not stopping with SIGTERM and systemd having to kill it with SIGKILL instead after a short delay, but I don't know if that has anything to do with Podman. (Something to look at, perhaps.)

If I add a non-root User= and Group= to the [Service] section and change nothing else, the service fails because that user doesn't have permissions to files under %t, which points to /run. This is to be expected.

If I then get rid of %t and instead change the --cidfile path (and all references to it) to /tmp/%n.ctr-id, I'd expect this to work since now we have write access, but it doesn't. What happens is this:

$ systemctl start hw  # a pause of 15-20 seconds follows before the below msg
Job for hw.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status hw.service" and "journalctl -xeu hw.service" for details.

$ journalctl -u hw.service  # timestamps/hostname cleaned up from output
systemd[1]: Starting Podman gallant_mahavira.service...
systemd[1]: hw.service: Failed with result 'protocol'.
systemd[1]: Failed to start Podman gallant_mahavira.service.
systemd[1]: hw.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Stopped Podman gallant_mahavira.service.
systemd[1]: Starting Podman gallant_mahavira.service...
systemd[1]: hw.service: Failed with result 'protocol'.
systemd[1]: Failed to start Podman gallant_mahavira.service.
...etc, a lot of restart attempts before I finally call systemctl stop hw...

Not reopening, since I'm not sure if this is still the same issue, or a new one. What things (supposedly) work with v4.2.0 in the context of this issue that didn't work in v4.1.1?

mheon commented 2 years ago

I think this would have to be a systemd change, not a Podman change. This may need to be reopened, as I'm unaware of any related fix merging in systemd.

vrothberg commented 2 years ago

Let's reopen, this is still not working.

rprodan commented 2 years ago

Any progress on this ticket?

vrothberg commented 2 years ago

No. This needs to be done on the systemd side and I am not sure there is a ticket.

rprodan commented 2 years ago

No. This needs to be done on the systemd side and I am not sure there is a ticket.

In this case rootless containers started with Podman can than only be run as systemd user service?

vrothberg commented 2 years ago

In this case rootless containers started with Podman can than only be run as systemd user service?

Yes. It works just fine using systemctl --user ... which will run the service/unit with the current user.

But using the User= primitive will not work.

vrothberg commented 2 years ago

I changed the title to better reflect the issue.

yangm97 commented 2 years ago

@vrothberg let's aim higher, DynamicUser=true support 🙂

coandco commented 2 years ago

Wouldn't DynamicUser=true mean having to redownload the container image every time the service is started?

rhatdan commented 2 years ago

If you setup the image in an additionalstore with the correct permissions, you would not need to.

yangm97 commented 2 years ago

One of the ways it could be done is by also setting StateDirectory=.

sjpb commented 2 years ago

"Interestingly" User= directives work fine in RockyLinux 8.6 when podman's using cgroups v1 (which is the default in RL8.6). Something like[^1] the below appears to work perfectly as far as I can tell - note podman is an unprivileged user. As you can see it was generated by an older podman generate systemd; newer versions seem to add a lot of --cidfile things which are broken anyway when rootless but seem to be to be unnecessary given we know/define the container name (or maybe I've misunderstood that)?.

However if I change this system to use cgroups v2 then the same unit file fails with the error message from the 1st message above. Which is problematic, as another container needs cgroups v2.

# mysql.service

[Unit]
Description=Podman container mysql.service
Documentation=man:podman-generate-systemd(1)
Wants=network.target
After=network-online.target
RequiresMountsFor=/var/lib/state/mysql /etc/sysconfig/mysqld

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
ExecStart=/usr/bin/podman run \
    --sdnotify=conmon --cgroups=no-conmon \
    --detach --replace --name mysql --restart=no \
    --user mysql \
    --volume /var/lib/state/mysql:/var/lib/mysql:U \
    --publish 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=${MYSQL_INITIAL_ROOT_PASSWORD} \
    mysql:8.0.30
ExecStop=/usr/bin/podman stop --ignore mysql -t 10
ExecStopPost=/usr/bin/podman rm --ignore -f mysql
SuccessExitStatus=143 SIGTERM
KillMode=none
Type=notify
NotifyAccess=all
LimitNOFILE=65536
LimitMEMLOCK=infinity
User=podman
Group=podman
TimeoutStartSec=180

[^1]: I've removed some mysql options and stuff todo with volume mount points

NickSica commented 2 years ago

"Interestingly" User= directives work fine in RockyLinux 8.6 when podman's using cgroups v1 (which is the default in RL8.6). Something like1 the below appears to work perfectly as far as I can tell - note podman is an unprivileged user. As you can see it was generated by an older podman generate systemd; newer versions seem to add a lot of --cidfile things which are broken anyway when rootless but seem to be to be unnecessary given we know/define the container name (or maybe I've misunderstood that)?.

However if I change this system to use cgroups v2 then the same unit file fails with the error message from the 1st message above. Which is problematic, as another container needs cgroups v2.

# mysql.service

[Unit]
Description=Podman container mysql.service
Documentation=man:podman-generate-systemd(1)
Wants=network.target
After=network-online.target
RequiresMountsFor=/var/lib/state/mysql /etc/sysconfig/mysqld

[Service]
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=always
ExecStart=/usr/bin/podman run \
    --sdnotify=conmon --cgroups=no-conmon \
    --detach --replace --name mysql --restart=no \
    --user mysql \
    --volume /var/lib/state/mysql:/var/lib/mysql:U \
    --publish 3306:3306 \
    -e MYSQL_ROOT_PASSWORD=${MYSQL_INITIAL_ROOT_PASSWORD} \
    mysql:8.0.30
ExecStop=/usr/bin/podman stop --ignore mysql -t 10
ExecStopPost=/usr/bin/podman rm --ignore -f mysql
SuccessExitStatus=143 SIGTERM
KillMode=none
Type=notify
NotifyAccess=all
LimitNOFILE=65536
LimitMEMLOCK=infinity
User=podman
Group=podman
TimeoutStartSec=180

Footnotes

  1. I've removed some mysql options and stuff todo with volume mount points

Do you get inotify and dbus-daemon errors with this? Mine runs but spits those kinds of errors out.

sjpb commented 2 years ago

Yes! But it doesn't seem to affect it.

hmoffatt commented 1 year ago

This is a limitation on the systemd side. They will only accept notifications, or PID files, that are created by or sent by root, for security reasons - even if the User and Group of the unit file are explicitly set to start the process as a non-root user. Their recommendation was to start the container as a user service of the user in question via systemctl --user. There have been a few other issues about this, I'll try and dig them up.

This doesn't seem quite right; I used User= in a test service using python-sdnotify (https://github.com/bb4242/sdnotify) and it launched as the specified user and systemd received the notification OK.

However when I run podman 4.3.1 in such a case, I too get the permission error:

Dec 05 23:05:20 bullseye systemd[1]: Starting <name>...
Dec 05 23:05:20 bullseye systemd[375]: Started podman-7639.scope.
Dec 05 23:05:20 bullseye systemd[1]: user@1002.service: Got notification message from PID 7639, but reception only permitted for main PID 375
Dec 05 23:05:20 bullseye podman[7639]: b121039d9c1573e73a97e6ce2d904803e9cc5d55a0c5696dcaee850cb599fc6e

I'm not sure what triggers Started podman-7639.scope, but I don't see this in the Python test service.

miwagner1 commented 1 year ago

Thoughts on

[Unit]
Description=Emby Podman Container
BindsTo=user@1012.service
After=user@1012.service

[Service]
User=emby
Group=media
Restart=on-failure
ExecStartPre=/usr/bin/rm -f /home/emby/%n-pid /home/emby/%n-cid
ExecStartPre=-/usr/bin/podman rm emby
ExecStart=/usr/bin/podman run --conmon-pidfile /home/emby/%n-pid --cidfile /home/emby/%n-cid \
          --name=emby --rm --cgroup-manager=systemd \
          -e TZ="$TZ" \
          -p 8096:8096 -p 8920:8920 \
          -v /opt/docker/storage/emby:/config \
          -v /media/media/:/media \
          emby/embyserver
ExecStop=/usr/bin/sh -c "/usr/bin/podman rm -f `cat /home/emby/%n-cid`"
KillMode=none
Type=forking
PIDFile=/home/emby/%n-pid

[Install]
WantedBy=multi-user.target

From https://unix.stackexchange.com/questions/590347/run-systemd-service-as-another-user-with-logind-session

vrothberg commented 1 year ago

I'm not sure what triggers Started podman-7639.scope, but I don't see this in the Python test service.

Most likely because of Podman joining a user namespace.

eriksjolund commented 1 year ago

I get the same result as @hmoffatt .

Here is a minimal test of __sd_notifyf()__ on Fedora 37 with systemd 251. (Podman was not used in the test)

Test result: systemd receives the notification without problem even if it comes from a process that is not running as root.

Details

  1. sudo dnf install systemd-devel gcc
  2. sudo useradd test1
  3. Create the file /etc/systemd/system/a.service with this file contents
    [Service]
    User=test1
    Group=test1
    ExecStart=/usr/local/bin/testnotify
    Type=notify
  4. Create the file /tmp/main.c with this file contents
    
    #include <unistd.h>
    #include <systemd/sd-daemon.h>
    int main() {
      sleep(10);
      sd_notifyf(0, "READY=1\n"
                   "STATUS=Processing requests...\n"
                   "MAINPID=%lu",
                   (unsigned long) getpid());
      sleep(3600);
      return 0;
    }
  5. cd /tmp && gcc -o testnotify main.c -l systemd
  6. sudo cp /tmp/testnotify /usr/local/bin/testnotify
  7. sudo chmod 755 /usr/local/bin/testnotify
  8. sudo systemctl daemon-reload
  9. sudo systemctl start a.service

Meanwhile in another shell

# systemctl status a.service
● a.service
     Loaded: loaded (/etc/systemd/system/a.service; static)
     Active: activating (start) since Sat 2023-03-18 12:55:30 CET; 3s ago
   Main PID: 6952 (testnotify)
      Tasks: 1 (limit: 8716)
     Memory: 300.0K
        CPU: 2ms
     CGroup: /system.slice/a.service
             └─6952 /usr/local/bin/testnotify

Mar 18 12:55:30 asus systemd[1]: Starting a.service...
# systemctl status a.service
● a.service
     Loaded: loaded (/etc/systemd/system/a.service; static)
     Active: active (running) since Sat 2023-03-18 12:55:40 CET; 87ms ago
   Main PID: 6952 (testnotify)
     Status: "Processing requests..."
      Tasks: 1 (limit: 8716)
     Memory: 300.0K
        CPU: 3ms
     CGroup: /system.slice/a.service
             └─6952 /usr/local/bin/testnotify

Mar 18 12:55:30 asus systemd[1]: Starting a.service...
Mar 18 12:55:40 asus systemd[1]: Started a.service.
# 

The first 10 seconds the service is in activating mode but then it is in active mode.

eriksjolund commented 1 year ago

Maybe the new systemd directive OpenFile= could be used for circumventing the ownership problem of PID files?

I mentioned the same idea here: https://github.com/containers/podman/discussions/17789#discussioncomment-5352418

vrothberg commented 1 year ago

Thanks for sharing! That's definitely worth exploring. Did somebody test it already?

ghost commented 1 year ago

Hello @vrothberg 👋

I tested the OpenFile directive without success.

As @hmoffatt and @eriksjolund mentionned, it seems to be possible to notify with very simple examples. This simple unit, does work ok (systemd 253):

[Service]
User=nobody
ExecStart=sh -c "sleep 1 && systemd-notify --ready"
Type=notify

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

vrothberg commented 1 year ago

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

See https://github.com/containers/podman/issues/12778#issuecomment-1008945410.

eriksjolund commented 1 year ago

Can you be more specific on why it’s an issue with podman? Is it because of the forking?

It seems a part of the problem is to set conmon PID as the MAINPID.

Quote from Git commit

Let's be more restrictive when validating PID files and MAINPID=
messages: don't accept PIDs that make no sense, and if the configuration
source is not trusted, don't accept out-of-cgroup PIDs. A configuratin
source is considered trusted when the PID file is owned by root, or the
message was received from root.

I tried to use OpenFile= to set MAINPID in a test (without using Podman) but it didn't work. (In the logs there was no mentioning of the MAINPID being read from the file) Some files related to the test: https://github.com/eriksjolund/test-systemd-mainpid-openfile/

Then I tried another test (also without using Podman) where I managed to set the MAINPID by using ExecStartPost with a leading + before the path to the executable. (Such a command is run as root)

ExecStartPost=+/usr/bin/mytest_notifymainpid

mytest_notifymainpid source code contains

    std::string msg = std::format("MAINPID={}\n", mainpid);
    sd_pid_notify(senderpid, 0, msg.c_str());

senderpid is here the PID of the program that I started with

ExecStart=/usr/bin/mytest_notifyready_and_then_sleep

An untested idea: Let Podman send the READY=1 and then wait for the program in ExecStartPost (/usr/bin/mytest_notifymainpid) to finish before continuing. (Waiting could maybe be achieved with some sort of trigger file)

Output from journalctl

Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: Got notification message from PID 4820 (MAINPID=4839, READY=1)
Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: New main PID 4839 does not belong to service, but we'll accept it as the request to change it came from a privileged process.
Jun 06 08:18:32 localhost.localdomain systemd[1]: test2.service: Supervising process 4839 which is not our child. We'll most likely not notice when it exits.
eriksjolund commented 1 year ago

It seems to work.

I tried out an echo server that listens on TCP port 908.

$ echo hello | socat  -t 60 - tcp4:127.0.0.1:908
hello
$

The echo server replied hello.

The file /etc/systemd/system/echo.socket contains:

[Unit]
Description=echo server

[Socket]
ListenStream=0.0.0.0:908

[Install]
WantedBy=default.target

The port number is smaller than 1024. An unprivileged user does not normally have the privileges to listen on such a port as I didn't modify _/proc/sys/net/ipv4/ip_unprivileged_portstart

$ cat /proc/sys/net/ipv4/ip_unprivileged_port_start
1024

The file /etc/systemd/system/echo.service contains:

[Unit]
Description=Podman container-echo.service
Wants=network-online.target
After=network-online.target
#RequiresMountsFor=%t/containers

[Service]
PAMName=login
Environment=PODMAN_SYSTEMD_UNIT=%n
Restart=on-failure
TimeoutStopSec=70
User=test
ExecStart=/usr/bin/podman run \
        --cidfile=/var/tmp/%n.ctr-id \
        --conmon-pidfile /var/tmp/conmon-pidfile \
        --cgroups=no-conmon \
        --rm \
        --sdnotify=conmon \
        --replace \
        --name echo \
        --network none ghcr.io/eriksjolund/socket-activate-echo
ExecStartPost=+/var/tmp/notify-mainpid /var/tmp/conmon-pidfile
ExecStop=/usr/bin/podman stop \
        --ignore -t 10 \
        --cidfile=/var/tmp/%n.ctr-id
ExecStopPost=/usr/bin/podman rm \
        -f \
        --ignore -t 10 \
        --cidfile=/var/tmp/%n.ctr-id
Type=notify
NotifyAccess=all

[Install]
WantedBy=default.target

A summary of the proof-of-concept demo

As soon as _libpod/containerinternal.go has sent READY=1, systemd will start the executable /var/tmp/notify-mainpid as root because the service was configured with

StartExecPost=+/var/tmp/notify-mainpid /var/tmp/conmon-pidfile

notify-mainpid is a little program I wrote as a proof-of-concept

#include <systemd/sd-daemon.h>
#include <format>
#include <iostream>
#include <fstream>

int main(int argc, char *argv[]) {
  if (argc != 2) {
    fprintf(stderr, "error: incorrect number of arguments\n");
    return 1;
  }  
  std::ifstream mainpid_stream(argv[1]);
  pid_t mainpid;
  mainpid_stream >> mainpid;
  char *podmanpidstr = getenv("SYSTEMD_EXEC_PID");
  pid_t podmanpid = atoi(podmanpidstr);
  std::string msg = std::format("MAINPID={}\nREADY=1", mainpid);
  sd_pid_notify(podmanpid, 0, msg.c_str());
  return 0;
}

notify-mainpid sends a notification message on behalf of the podman process (the current MAINPID) and notifies systemd that MAINPID should be equal to the conmon PID.

I created a branch https://github.com/eriksjolund/podman/tree/issue-12778-proof-of-concept-sdnotify-conmon

where I put the code. There is room for a lot of improvements, for example to replace the racy solution with the 5 seconds delay with something else.

This demo was for --sdnotify=conmon. Doing somthing similar for --sdnotify=container will be more complicated as more synchronization will have to take place. Maybe OpenFile= could be used to improve security and synchronization.