systemd-oomd for Fedora CoreOS

dustymabe commented 3 years ago

Describe the enhancement

One Fedora 34 change was for enabling systemd-oomd by default. My original understanding was that swap was required for this to work, but that no longer appears to be the case.

In an effort to keep differences from Fedora proper to a minimum we should consider enabling systemd-oomd.

darkmuggle commented 3 years ago

Per the meeting, @dustymabe and I will be investigating this. This topic will be revisited at the next meeting.

dustymabe commented 3 years ago

Also @JaimeMagiera, who will be bringing it to the OKD working group.

@dghubble might have some input as well.

dustymabe commented 3 years ago

My action for this week:

systemd-oomd does not operate on systems that aren't cgroups v2. Though it does handle it gracefully by skipping startup with ConditionControlGroupController=v2:

https://github.com/systemd/systemd-stable/blob/37c4cfde0ce613f0f00544d3f4e2e72bf93d9c76/units/systemd-oomd.service.in#L16

Also of interest: The defaults for Fedora are set in the systemd-oomd-defaults package. So we'd want to install that if we decide to enable this feature.

travier commented 3 years ago

So far, the OKD WG does not think this is useful for OKD as this might create issues: the kubelet would think a random process crashed and would restart the pod/application. @LorbusChris will reach out to G. Scrivano for more input.

See discussion in https://github.com/openshift/okd/discussions/663 and notes in https://hackmd.io/YJBn04R5TDi5Sm9XbOGwZA.

LorbusChris commented 3 years ago

@giuseppe do you think systemd-oomd is useful in the case where most workloads are owned by the kubelet (i.e. OpenShift/OKD)?

giuseppe commented 3 years ago

@LorbusChris I am not really sure as I've not played with it but I share the same concern as @travier: this could create confusion as both the kubelet and systemd-oomd on a OOM situation will try to kill processes and end up stepping on each other's toes.

dustymabe commented 3 years ago

Let's get kubernetes upstream (or individual downstreams) to exclude the certain systemd units they want from systemd-oomd consideration.

travier commented 3 years ago

Let's get kubernetes upstream (or individual downstreams) to exclude the certain systemd units they want from systemd-oomd consideration.

I don't think that this is the behavior we want here. We probably don't want to exclude all pods from being killed as systemd-oomd would then target system daemons that we probably don't want to have killed either.

jlebon commented 3 years ago

Probably more appropriate to just have k8s/OKD disable systemd-oomd entirely instead?

travier commented 3 years ago

Probably more appropriate to just have k8s/OKD disable systemd-oomd entirely instead?

That's what OKD folks are planning to do

travier commented 3 years ago

systemd-oomd is enabled in degraded mode (no swap) in Fedora Cloud:

systemd-oomd[510]: Swap is currently not detected; memory pressure usage will be degraded

and also enabled (with swap with default filesystem layout) in Fedora Server.

dghubble commented 3 years ago

I can't see how systemd-oomd would be useful on a Kubernetes node, where Kubelet already makes workload eviction decisions. I guess the question is, is there a case for having kubelet manage workloads and systemd-oomd manage non-Kubernetes processes. I think the two would clash or cause confusion depending on the timing of their actions.

dustymabe commented 3 years ago

We discussed this in the community meeting today. A few relevant bits:

12:54:02      dustymabe | #info after talking to OKD, they plan to disable
                        | systemd-oomd. We stil need to talk to typhoon. In order to
                        | get more information from actual users we might end up
                        | enabling systemd-oomd or swap-on-zram+systemd-oomd in our
                        | next stream to get feedback on any potential issues.

Also:

12:55:18      dustymabe | #action dustymabe to add butane config to #840 to show
                        | how to enable systemd-oomd

dustymabe commented 3 years ago

I can't see how systemd-oomd would be useful on a Kubernetes node, where Kubelet already makes workload eviction decisions. I guess the question is, is there a case for having kubelet manage workloads and systemd-oomd manage non-Kubernetes processes. I think the two would clash or cause confusion depending on the timing of their actions.

Thanks @dghubble for the Typhoon perspective. Since my suggestion in https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-857832415 wasn't a good one, an alternative may be to have FCOS default to disable systemd-oomd if you're running kubernetes.

Might be able to achieve this with an ExecCondition like:

cat <<EOF > /etc/systemd/systemd-oomd.service.d/disable-on-kubernetes.conf
ExecCondition=/bin/bash -xc '/usr/sbin/systemctl is-enabled --quiet kubelet.service && exit 1 || exit 0'
EOF

Either way let's see how the exploratory testing that was proposed in the community meeting goes.

cgwalters commented 3 years ago

I think this is an upstream Kubernetes + systemd issue, not specific to FCOS (or OKD/OpenShift). It probably makes the most sense to track it in upstream Kubernetes. The conclusion may simply be to add a recommendation that systems installing kubelet also disable/mask systemd-oomd.service. If you can install kubelet.service you can do that too.

I don't see the need for us to carry a FCOS-specific thing to look for a hardcoded kubelet.service that doesn't exist by deafult.

dghubble commented 3 years ago

I'm happy to disable systemd-oomd.service in Typhoon to keep the OS from needing to have an ExecCondition for it. Seems ok to do now that every channel is on F34 and can be present, in case FCOS would like to enable it.

cgwalters commented 3 years ago

I don't see the need for us to carry a FCOS-specific thing to look for a hardcoded kubelet.service that doesn't exist by deafult.

To elaborate though I do see the value in the idea of trying to be compatible in this space. But if we're saying FCOS is independent of Kubernetes, then our special OS side hack isn't going to help someone who is e.g. doing something custom with podman or docker across upgrades.

And so far no one has argued for making this one a "provisioning discontinuity" like cgroupsv2 (right?).

Another aspect to this is that having oomd enabled by default is unlikely to be instantly fatal - some systems may start failing in some cases on upgrades (admittedly in a way that may be hard to debug) but unlike cgroupsv2 it's not going to instantly break things that aren't ready for it.

So I think the strawman is:

Get the message out that oomd is coming and ensure higher level FCOS consumers have prepared if necessary
Enable by default (but maybe delay another Fedora release?)

darkmuggle commented 3 years ago

Fedora enables systemd-oomd for all variants[1] and the server uses the default profile. During my deep dive on the question of OOM, my conclusion is that enabling swap by default would give us the bigger bang for the buck. The misconception about swap is that its a place to hold memory (or expand memory) but really swap is a place where inactive pages can be flushed to disk analogous to a file cached in memory. When a system is under memory pressure, it will try to free pages, and not having swap means that the pressure can result in OOM's quickly. systemd-oomd can terminate a badly behaving program, but without swap, we lose some burst-ability.

I think we should enable it by default, but without some swap buffer, we're going to make entire pods subject to be killed on burst -- systemd-oomd kills the cgroup so people could have their well-behaved DB nuked because their $APP went a little crazy.

[1] https://src.fedoraproject.org/rpms/fedora-release/blob/rawhide/f/90-default.preset [2] https://src.fedoraproject.org/rpms/fedora-release/tree/rawhide

dustymabe commented 3 years ago

@cgwalters So I think the strawman is:>

Get the message out that oomd is coming and ensure higher level FCOS consumers have prepared if necessary

Enable by default (but maybe delay another Fedora release?)

Yeah I think that's the idea. We'll initially do a round of exploratory testing amongst ourselves (i'll post a butane config in this issue). Subsequently maybe let it soak in next for a while before moving to testing and on to stable.

@darkmuggle During my deep dive on the question of OOM, my conclusion is that enabling swap by default would give us the bigger bang for the buck.

Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.

travier commented 3 years ago

I took another look at what system-oomd does and I agree that we should probably start the process of enabling it by default at least on new nodes with a transition period like we are doing for countme/cgroupsv2.

Looks like we need to seriously consider enabling swap on zram too.

cgwalters commented 3 years ago

Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.

Right, swap-on-zram is (AIUI) much more predictable than traditional swap-on-block.

(Because I am old, I remember seeing boxed software in Micro Center for (I think it was) a Windows 95 addon that was literally exactly compressed RAM...oh wow, right the Internet never forgets. I guess CPUs have gotten so much faster for some workloads it can make sense to burn a small portion of CPU to gain density)

dustymabe commented 3 years ago

12:55:18      dustymabe | #action dustymabe to add butane config to #840 to show
                        | how to enable systemd-oomd

A bit of an oversight on our part, but I just noticed it's already enabled in FCOS (brought in with systemd defaults in f34). So if you're running cgroups v2 (default for newly deployed nodes), you're already running it.

What we don't have is the systemd-oomd-defaults package that delivers some of the configuration.

The below config will enable swap-on-zram and install the systemd-oomd-defaults package.

variant: fcos
version: 1.3.0
storage:
  files:
    - path: /etc/systemd/zram-generator.conf
      mode: 0644
      contents:
        inline: |
          # This config file enables a /dev/zram0 device with the default settings
          [zram0]
systemd:
  units:
    - name: rpm-ostree-install.service
      enabled: true
      contents: |
        [Unit]
        Description=Layer rpms
        # We run after `systemd-machine-id-commit.service` to ensure that
        # `ConditionFirstBoot=true` services won't rerun on the next boot.
        After=systemd-machine-id-commit.service
        After=network-online.target
        ConditionPathExists=!/var/lib/rpm-ostree-install.stamp

        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStart=/usr/bin/rpm-ostree install --allow-inactive systemd-oomd-defaults
        ExecStart=/bin/touch /var/lib/rpm-ostree-install.stamp
        ExecStart=/bin/systemctl --no-block reboot

        [Install]
        WantedBy=multi-user.target

Please test! Either with or without (just remove the /etc/systemd/zram-generator.conf file in the butane config) zram would be nice.

dustymabe commented 3 years ago

@dustymabe Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.

Created the ticket for that discussion: https://github.com/coreos/fedora-coreos-tracker/issues/859

dghubble commented 3 years ago

Maybe I've missed something, it doesn't seem enabled by default yet.

$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2021-06-10 22:02:40 UTC)
Deployments:
● ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 34.20210529.2.0 (2021-06-01T19:23:21Z)
                BaseCommit: d7ad41d882de1a9b5652d29ea69b0aedb83e5dec66cb4ce379ff651af14536ee
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
           LayeredPackages: qemu-user-static

  ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 34.20210529.2.0 (2021-06-01T19:23:21Z)
                    Commit: d7ad41d882de1a9b5652d29ea69b0aedb83e5dec66cb4ce379ff651af14536ee
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
$ systemctl status systemd-oomd
○ systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer
     Loaded: loaded (/usr/lib/systemd/system/systemd-oomd.service; disabled; vendor preset: disabled)
     Active: inactive (dead)
       Docs: man:systemd-oomd.service(8)
$ grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

dustymabe commented 3 years ago

Maybe I've missed something, it doesn't seem enabled by default yet.

You're right. I think the system I checked on was one I was using to explore systemd-oomd a few weeks back.

cgwalters commented 3 years ago

You're right. I think the system I checked on was one I was using to explore systemd-oomd a few weeks back.

And that knowledge went...out of your memory eh?

dustymabe commented 3 years ago

Totally evicted!

dustymabe commented 3 years ago

Maybe I've missed something, it doesn't seem enabled by default yet.

You're right. I think the system I checked on was one I was using to explore systemd-oomd a few weeks back.

I think something else that added to my confusion is that the butane config I added in https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-859094939 did work even though I didn't explicitly enable systemd-oomd. Just adding the systemd-oomd-defaults package makes it start for some reason, though it's not clear to me how.

dustymabe commented 3 years ago

Just adding the systemd-oomd-defaults package makes it start for some reason, though it's not clear to me how.

@miabbott pointed out to me that adding configuration (like ManagedOOMMemoryPressure=kill) to any systemd unit will activate systemd-oomd. We'll need to mask it in cases where we don't want it to be run.

travier commented 3 years ago

Just adding the systemd-oomd-defaults package makes it start for some reason, though it's not clear to me how.

@miabbott pointed out to me that adding configuration (like ManagedOOMMemoryPressure=kill) to any systemd unit will activate systemd-oomd. We'll need to mask it in cases where we don't want it to be run.

OK, this is a point for enabling (or not masking) by default as potentially expected functionality will break if we mask the unit.

dustymabe commented 3 years ago

OK, this is a point for enabling (or not masking) by default as potentially expected functionality will break if we mask the unit.

yeah, though we should find and fix those bugs in systemd/oomd if they exist. User's should be able to mask it without side effects.

dustymabe commented 3 years ago

We discussed this in the community meeting today.

@jdoss is working on doing some testing and we'll hear back from him next week.

We did make a small decision:

  * AGREED: since oomd works better with swap, let's tie the swaponzram
    proposal and the oomd proposals together. If we do one, we do the
    other.  (dustymabe, 16:50:58)

but we also decided to take a step back and discuss single node versus kubernetes defaults briefly first: https://github.com/coreos/fedora-coreos-tracker/issues/880

andygeorge commented 3 years ago

@jdoss is working on doing some testing

(below hosts are both on testing-devel/builds/34.20210620.20.0)

We recently ran into some memory leaks in our core app that caused the app (and eventually the fcos host) to go unresponsive. oom-killer did do some work, but did not kill the container causing the memory leak, which prevented our app from recovering automatically.

no swap-on-zram or systemd-oomd-defaults:

We just tested this against another host with swap-on-zram enabled + systemd-oomd-defaults installed (using @dustymabe's config; thank you for that unit!): swap-on-zram worked great and the systemd-oomd (w/defaults) killed our offending container as hoped, without any adverse affects on our app.

swap-on-zram enabled + systemd-oomd-defaults installed:

...with a very small/handy entry that we'll be able to easily use for logging/alerting to catch the next memory leak (maybe this will be our last! ha ha):

[core@community (community.testhost.com) ~]$ journalctl -u systemd-oomd -f
-- Journal begins at Thu 2021-06-24 14:06:29 UTC. --
Jun 24 14:18:27 community.testhost.com systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer...
Jun 24 14:18:28 community.testhost.com systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Jun 24 16:35:32 community.testhost.com systemd-oomd[3455]: Killed /machine.slice/machine-libpod_pod_5cd150410f0f8fd2825bd8232e212ec151f1aaa9e1899762efa5012ebdc522f4.slice/libpod-a20b33b96087e0ce7637845abde6efedee3bd6ac233f4ba98f1eadc4512a51f5.scope/container due to swap used (1854169088) / total (2057302016) being more than 90.00%

From our perspective, this is definitely a nicer experience with those two enabled.

cmurf commented 2 years ago

Maybe it's time to revist this now?

jdoss commented 2 years ago

I agree. We should enable it by default.

miabbott commented 2 years ago

kannon92 commented 8 months ago

I wanted to play around with installing systemd-oomd on FCOS. Has anyone found any useful documentation on what is needed for that?

I've actually been making a bit of progress of investigating how one could use systemd-oomd, swap and kubernetes on fedora. But not really sure how one installs this on FCOS.

doh! https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-859094939

coreos / fedora-coreos-tracker

systemd-oomd for Fedora CoreOS #840