Better settings for running podman containers under systemd

alexlarsson commented 2 years ago

I'm interested in getting the best possible integration between systemd and podman when running containers from "podman run" in the ExecStart line of a systemd unit. This affects multiple things, for instance the output of podman systemd generate --new, the output of quadlet, and the template service unit file for play kube support.

My goals are:

Make the unit file easy to read (minimal copy/paste scaffolding)
Make startup efficient (minimal number of execs, etc)
Robust shutdowns in case of races
Allow systemd maximum insight into the container

So, here are a list of things I think we could do better:

Use --cgroup=split (currently unset in kube template, and set to no-conmon in systemd generate).

For example, if I use systemd generate --new to create foo.service i get this from systemd-cgls:

-.slice
├─system.slice (#60)
| | ...
│ ├─foo.service (#12748)
│ │ → trusted.invocation_id: 1be92eda65e44d76a51b800e1327f4a5
│ │ └─ 6720 /usr/bin/conmon --api-version 1 -c 5d476e3704141d03c898fa0d7f81e688…
| ...
└─machine.slice (#1022)
  → trusted.invocation_id: 8c0b9e872e674c41b0aa10d70c29f2d5
  └─libpod-5d476e3704141d03c898fa0d7f81e688c2c64caa5d66fa2b50433ea884b6fdf3.scope … (#12822)
    → trusted.delegate: 1
    → trusted.invocation_id: 719df90757144078b3c78349b94bc1d6
    └─container (#12889)
      └─ 6723 sleep 100000

Whereas with quadlet (that uses cgroup=split) i get:

-.slice
├─system.slice (#60)
| | ...
│ ├─quadlet-minimal.service … (#21605)
│ │ → trusted.delegate: 1
│ │ → trusted.invocation_id: 80fd019d600647a1af14bf89f943e64f
│ │ ├─libpod-payload-a2fb81d0d42259207831fa764c38e4e4d7f3ef345e52cf423a55596cac1b356b (#21739)
│ │ │ ├─ 13365 /run/podman-init -- sleep 60
│ │ │ └─ 13387 sleep 60
│ │ └─runtime (#21672)
│ │   └─ 13360 /usr/bin/conmon --api-version 1 -c a2fb81d0d42259207831fa764c38e…
| ...

In this setup the cgroups are inside the systemd created cgrops which means that systemd is aware of all the processes running in the container and can properly manage them. It also means systemd-originating cgroup limits are correctly applied.

For this to work properly, the systemd unit should specify Delegate=yes, as per https://systemd.io/CGROUP_DELEGATION/.

Default to --sdnotify=conmon

This means that we don't report success to systemd until we actually spawn the container. This is already enabled in systemd generate, but not currently in the kubernetes template. However there needs to be a way to override this to --sdnotify=container in case your container itself supports sdnotify.

Robust shutdown

No matter how the container is shut down we need things to be robust and not leave any state around that is picked up by e.g. "podman ps". Systemd itself is generally not a problem here because it gets told by the kernel (via cgroups, if set up correctly) exactly what the current state is.

This is what systemd generate does for this (stripped down):

ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%t/%n.ctr-id \--rm -d fedora sleeep 999
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=%t/%n.ctr-id
Type=notify
NotifyAccess=all

Whereas quadlet does (stripped down):

ExecStartPre=-rm -f %t/%N.cid
ExecStart=/usr/bin/podman run --cidfile=%t/%N.cid --name=systemd-%N --replace --rm -d fedora sleep 999
ExecStopPost=-/usr/bin/podman rm -f -i --cidfile=%t/%N.cid
ExecStopPost=-rm -f %t/%N.cid
KillMode=mixed
Type=notify
NotifyAccess=all

See the "Robust container shutdown" section in https://github.com/containers/quadlet/blob/main/docs/ContainerSetup.md for more details of why I chose to set it up like this.

We can argue about the details, but one thing I'd like is for this to be less open-coded. By this I mean, we should not have to have an ExecStartPre to remove the .cid file, we should just get the right behaviour automatically by the podman run command. Same with the two StopPost commands quadlet has, it should just be combined into one command.

Logging

Quadlet currently uses --log-driver journald, since if you are using systemd it is likely that you want to use the journal and not podman logs. However, since https://github.com/containers/podman/pull/11390 we have access to --log-driver passthrough, which is an even better fit, and should always be used.

Default to using --init, or at least make it easy to enable

Most services actually don't properly function as pid1, and there is a large risk for things not properly reaping grandparents that die, so we collect a bunch of zombies. catatonit is really really small, having an instance of it in the container is not going to be a problem.

Various other things

Set SyslogIndentifier, because the default is otherwise the main exec name, which is always "podman" and this is not great.

Quadlet defaults to --pull=never, because it is maybe not great to start doing network pulls each time we boot. Not sure about this, it goes with the auto-update stuff.

Quadlets always sets up /tmp to a per-process tmpfs to mirror what typically happens with system services. Not sure if this is generally the right thing to do either.

vrothberg commented 2 years ago

Thanks, @alexlarsson! Let's put our heads together soon. I think we can use the PODMAN_SYSTEMD_UNIT env variable to automatically set these things.

Luap99 commented 2 years ago

Logging

Quadlet currently uses --log-driver journald, since if you are using systemd it is likely that you want to use the journal and not podman logs. However, since #11390 we have access to --log-driver passthrough, which is an even better fit, and should always be used.

This makes podman logs and attach non functional. We should not break such functionality by default. Also there doesn't seems to be a single passthrough test in our CI!!!

alexlarsson commented 2 years ago

@Luap99 "By default" I mean when running a container under systemd. It seems more likely in that case that you want to use the logging setup that you do for other types of systemd units. And passthrough has real advantages in this case, like the journal can look at the actual peer pid and see what the source of the log messages are, rather than saying "conmon" sent everything.

Luap99 commented 2 years ago

@Luap99 "By default" I mean when running a container under systemd. It seems more likely in that case that you want to use the logging setup that you do for other types of systemd units.

I am aware of that but I disagree. I run all my containers via systemd, yet I use podman logs everytime I have to read logs and not some journalctl command. I would argue that podman users are used to podman logs regardless how you run the container.

alexlarsson commented 2 years ago

I don't think that is the best default though, as it will perform worse and integrate worse with systemd. However, whichever way is the default it should be easy to pick the other.

vrothberg commented 2 years ago

I don't think that is the best default though, as it will perform worse and integrate worse with systemd. However, whichever way is the default it should be easy to pick the other.

Yes, it should be easy to chose. generate systemd can remain as is but the new generator can default to passthrough. Adding tests for that is crucial. Thanks for highlighting that, @Luap99 !

vrothberg commented 2 years ago

I am going to tackle a couple of issues now. @alexlarsson, please let me know when you progressed porting Quadlet. Ultimately, I want Quadlet and generate systemd to share the generator. There are still some community PRs open that touch pkg/systemd/generate that I want to wait for.

Cleaning up pkg/systemd/generate won't hurt.

alexlarsson commented 2 years ago

I almost have an initial port of quadlet done, give me a day more.

alexlarsson commented 1 year ago

So, here is the initial straight port of the C quadlet to golang: https://github.com/containers/quadlet/pull/41

It passes the same testcase, although I haven't (yet) actually tried it as a system generator. Its somewhat large, at 1.7M (stripped) vs the C one (63K), but the C one links against 2.3M libc and 2.3M other libs, so the total loaded size depends on what other stuff in the system is using.

The next step is trying to figure out how this best goes into podman, and how the two can be moved closer to each other.

What I would like is to extend this code to also support using a yaml file for specifying the podman commandline instead of the [container] group in the file. Either by having a yaml=path in the .container file, or by having a yaml file with the same name next to the .container file.

Other nice goals is to reuse code between this and the other systemd generators. Maybe the next step is for @vrothberg to have a look at this code and see what makes sense?

On top of that, the current generated podman cmdline is just what quadlet does. As discussed in this issue we might want to add features to core podman to make the generated code less complex.

I also think that we might want to reconsider some defaults in the quadlet conversion to make it match more what e.g. systemd-generate does. For example, defaulting to remapping users was maybe not a great idea.

vrothberg commented 1 year ago

Thanks, Alex! I will be traveling next week. Let's put our heads together the following week with @rhatdan et al.

goochjj commented 1 year ago

Moving quadlet towards golang, and making some of these options configurable through Go templates would be pretty cool.

rhatdan commented 1 year ago

That's the goal. We want to formalize the relationship and start encouraging people to use quadlet for running systemd based OCI containers.

goochjj commented 1 year ago

I've been considering this more and I'm not sure I feel about it.

I've done a lot of tricky things with systemD units - including dependencies, after/before stuff, depending on volume creation or other volume mounts, adding ExecStartPre=/usr/bin/systemctl is-active somedependency lines, etc. I'm not sure the format of /etc/containers/systemd/ will provide enough flexibility.

It seems to me like you're trying to make the command line portion more extendable, while also doing things like setting up defaults for notify and other things.

I guess my concern is the command line issues - is it because the command line is just... well..complicated? If so, can that be unwound by a wrapper script instead? Perhaps using environment variables.

I guess it feels kinda like a code-smell to create a new declarative language (/etc/containers/systemd/ files) to run through a fixed golang/C program (which is hardcoded with defaults) and then generate... another declarative language.

Similarly, we'd have to deal with the install section, unless the goal is to have dependencies not expressed by quadlet in /etc/systemd/system/ in *.wants/ folders or dropins - which then feels funny because the scripts are then split across directories.

Even if the golang program takes an approach (like, say Jinja templates) with golang templates - if I modify the template, it nullifies your goal of "we want to help users do this by having reasonable defaults we release with podman, and can update so all users get them" because then there has to be SOME sort of merge, whether that's a traditional .rpmorig/.rpmdist approach or *.d/ directory.

I'm not sure I have a suggestion. I'm guessing in MY case, I'll just continue to do it myself, and that's fine, and I may check in on quadlet/whateverthisbecomes to get appropriate defaults in the future to merge into my units.

alexlarsson commented 1 year ago

I'm not sure what you mean by "a new declarative language". The .container files are just systemd unit files. Only the "container" group is handled specially, everything else is essentially passed as is, which means you can use any complex systemd unit features you want, like dependencies, etc. All the features you mention can just be added as-is to the .container files.

I'm not sure what your issue is with the install section. The generator parses it already and creates symlinks in .wants dirs (etc) as needed.

The goal is to make it easy to create optimal systemd integrated podman service unit files with the full features of both podman and systemd without having to know a lot of details about the technical internals. Obviously you don't have to use it if you don't want to.

In addition we want to extend this to allow the ability to specify the container details with kubernetes yaml instead of using the container group in the unit file.

goochjj commented 1 year ago

Let me rephrase, I've only looked at it for about an hour :-D. Had to compile the C version for Flatcar and get it arranged, figure out how to run meson in a container, then I had to move it from /usr/local/etc (where it defaulted) to /etc, and then I looked at the generated file. So I didn't spend a whole lot of time.

w.r.t install section, some of my stuff is setup as

myservice.service (runs /bin/true) myservice-net.service -> PartOf myservice myservice-redis.service -> PartOf myservice myservice-thething.service -> PartOf myservice

I think this corresponds to myservice-net myservice-redis myservice-thething in /etc/containers/systemd/ and keeping myservice.service in the /etc/systemd/system folder.

Similarly I'll want to check things in /etc/containers/systemd/ to see if someservice@.service type units work.

I will do more testing. If you'd prefer, I can test more with the golang version (vs the C version). LIke I said I haven't made up my mind, and my opinion is just one, so I'll try to convert more of my configs over to quadlet and see how things go, and be more specific in the future, deal?

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 1 year ago

@vrothberg @Luap99 @alexlarsson @goochjj What should we do with this issue?

vrothberg commented 1 year ago

Leave it open. Many things are not yet addressed.

hillar commented 1 year ago

So if we take podman generate systemd --sdnotify=container ... notifyproxy already proxies ready and errors.

However there is more: watchdog,state,journaling

? Should i open new feature request conmon proxies sd_notify

if  sdnotify is container, then allow the OCI runtime to proxy the socket into the container to receive sd_notify messages

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 1 year ago

@alexlarsson @vrothberg Any update?

vrothberg commented 1 year ago

I think for Quadlet this work is done. generate systemd received some updates but there's some difference still. Since we want users to migrate over to Quadlet, it may be a good thing to keep things as they are now.

edsantiago commented 1 week ago

Podman has moved to quadlet

containers / podman

Better settings for running podman containers under systemd #15686