Open dustymabe opened 3 years ago
Per the meeting, @dustymabe and I will be investigating this. This topic will be revisited at the next meeting.
Also @JaimeMagiera, who will be bringing it to the OKD working group.
@dghubble might have some input as well.
My action for this week:
systemd-oomd
does not operate on systems that aren't cgroups v2. Though it does handle it gracefully by skipping startup with ConditionControlGroupController=v2
:
Also of interest: The defaults for Fedora are set in the systemd-oomd-defaults
package. So we'd want to install that if we decide to enable this feature.
So far, the OKD WG does not think this is useful for OKD as this might create issues: the kubelet would think a random process crashed and would restart the pod/application. @LorbusChris will reach out to G. Scrivano for more input.
See discussion in https://github.com/openshift/okd/discussions/663 and notes in https://hackmd.io/YJBn04R5TDi5Sm9XbOGwZA.
@giuseppe do you think systemd-oomd is useful in the case where most workloads are owned by the kubelet (i.e. OpenShift/OKD)?
@LorbusChris I am not really sure as I've not played with it but I share the same concern as @travier: this could create confusion as both the kubelet and systemd-oomd on a OOM situation will try to kill processes and end up stepping on each other's toes.
Let's get kubernetes upstream (or individual downstreams) to exclude the certain systemd units they want from systemd-oomd consideration.
Let's get kubernetes upstream (or individual downstreams) to exclude the certain systemd units they want from systemd-oomd consideration.
I don't think that this is the behavior we want here. We probably don't want to exclude all pods from being killed as systemd-oomd would then target system daemons that we probably don't want to have killed either.
Probably more appropriate to just have k8s/OKD disable systemd-oomd entirely instead?
Probably more appropriate to just have k8s/OKD disable systemd-oomd entirely instead?
That's what OKD folks are planning to do
systemd-oomd is enabled in degraded mode (no swap) in Fedora Cloud:
systemd-oomd[510]: Swap is currently not detected; memory pressure usage will be degraded
and also enabled (with swap with default filesystem layout) in Fedora Server.
I can't see how systemd-oomd would be useful on a Kubernetes node, where Kubelet already makes workload eviction decisions. I guess the question is, is there a case for having kubelet manage workloads and systemd-oomd manage non-Kubernetes processes. I think the two would clash or cause confusion depending on the timing of their actions.
We discussed this in the community meeting today. A few relevant bits:
12:54:02 dustymabe | #info after talking to OKD, they plan to disable
| systemd-oomd. We stil need to talk to typhoon. In order to
| get more information from actual users we might end up
| enabling systemd-oomd or swap-on-zram+systemd-oomd in our
| next stream to get feedback on any potential issues.
Also:
12:55:18 dustymabe | #action dustymabe to add butane config to #840 to show
| how to enable systemd-oomd
I can't see how systemd-oomd would be useful on a Kubernetes node, where Kubelet already makes workload eviction decisions. I guess the question is, is there a case for having kubelet manage workloads and systemd-oomd manage non-Kubernetes processes. I think the two would clash or cause confusion depending on the timing of their actions.
Thanks @dghubble for the Typhoon perspective. Since my suggestion in https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-857832415 wasn't a good one, an alternative may be to have FCOS default to disable systemd-oomd
if you're running kubernetes.
Might be able to achieve this with an ExecCondition
like:
cat <<EOF > /etc/systemd/systemd-oomd.service.d/disable-on-kubernetes.conf
ExecCondition=/bin/bash -xc '/usr/sbin/systemctl is-enabled --quiet kubelet.service && exit 1 || exit 0'
EOF
Either way let's see how the exploratory testing that was proposed in the community meeting goes.
I think this is an upstream Kubernetes + systemd issue, not specific to FCOS (or OKD/OpenShift). It probably makes the most sense to track it in upstream Kubernetes. The conclusion may simply be to add a recommendation that systems installing kubelet also disable/mask systemd-oomd.service
. If you can install kubelet.service
you can do that too.
I don't see the need for us to carry a FCOS-specific thing to look for a hardcoded kubelet.service
that doesn't exist by deafult.
I'm happy to disable systemd-oomd.service
in Typhoon to keep the OS from needing to have an ExecCondition for it. Seems ok to do now that every channel is on F34 and can be present, in case FCOS would like to enable it.
I don't see the need for us to carry a FCOS-specific thing to look for a hardcoded kubelet.service that doesn't exist by deafult.
To elaborate though I do see the value in the idea of trying to be compatible in this space. But if we're saying FCOS is independent of Kubernetes, then our special OS side hack isn't going to help someone who is e.g. doing something custom with podman or docker across upgrades.
And so far no one has argued for making this one a "provisioning discontinuity" like cgroupsv2 (right?).
Another aspect to this is that having oomd enabled by default is unlikely to be instantly fatal - some systems may start failing in some cases on upgrades (admittedly in a way that may be hard to debug) but unlike cgroupsv2 it's not going to instantly break things that aren't ready for it.
So I think the strawman is:
Fedora enables systemd-oomd
for all variants[1] and the server uses the default profile. During my deep dive on the question of OOM, my conclusion is that enabling swap by default would give us the bigger bang for the buck. The misconception about swap is that its a place to hold memory (or expand memory) but really swap is a place where inactive pages can be flushed to disk analogous to a file cached in memory. When a system is under memory pressure, it will try to free pages, and not having swap means that the pressure can result in OOM's quickly. systemd-oomd
can terminate a badly behaving program, but without swap, we lose some burst-ability.
I think we should enable it by default, but without some swap buffer, we're going to make entire pods subject to be killed on burst -- systemd-oomd kills the cgroup so people could have their well-behaved DB nuked because their $APP went a little crazy.
[1] https://src.fedoraproject.org/rpms/fedora-release/blob/rawhide/f/90-default.preset [2] https://src.fedoraproject.org/rpms/fedora-release/tree/rawhide
@cgwalters So I think the strawman is:>
- Get the message out that oomd is coming and ensure higher level FCOS consumers have prepared if necessary
- Enable by default (but maybe delay another Fedora release?)
Yeah I think that's the idea. We'll initially do a round of exploratory testing amongst ourselves (i'll post a butane config in this issue). Subsequently maybe let it soak in next
for a while before moving to testing
and on to stable
.
@darkmuggle During my deep dive on the question of OOM, my conclusion is that enabling swap by default would give us the bigger bang for the buck.
Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.
I took another look at what system-oomd does and I agree that we should probably start the process of enabling it by default at least on new nodes with a transition period like we are doing for countme/cgroupsv2.
Looks like we need to seriously consider enabling swap on zram too.
Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.
Right, swap-on-zram is (AIUI) much more predictable than traditional swap-on-block.
(Because I am old, I remember seeing boxed software in Micro Center for (I think it was) a Windows 95 addon that was literally exactly compressed RAM...oh wow, right the Internet never forgets. I guess CPUs have gotten so much faster for some workloads it can make sense to burn a small portion of CPU to gain density)
12:55:18 dustymabe | #action dustymabe to add butane config to #840 to show | how to enable systemd-oomd
A bit of an oversight on our part, but I just noticed it's already enabled in FCOS (brought in with systemd defaults in f34). So if you're running cgroups v2 (default for newly deployed nodes), you're already running it.
What we don't have is the systemd-oomd-defaults
package that delivers some of the configuration.
The below config will enable swap-on-zram and install the systemd-oomd-defaults
package.
variant: fcos
version: 1.3.0
storage:
files:
- path: /etc/systemd/zram-generator.conf
mode: 0644
contents:
inline: |
# This config file enables a /dev/zram0 device with the default settings
[zram0]
systemd:
units:
- name: rpm-ostree-install.service
enabled: true
contents: |
[Unit]
Description=Layer rpms
# We run after `systemd-machine-id-commit.service` to ensure that
# `ConditionFirstBoot=true` services won't rerun on the next boot.
After=systemd-machine-id-commit.service
After=network-online.target
ConditionPathExists=!/var/lib/rpm-ostree-install.stamp
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/rpm-ostree install --allow-inactive systemd-oomd-defaults
ExecStart=/bin/touch /var/lib/rpm-ostree-install.stamp
ExecStart=/bin/systemctl --no-block reboot
[Install]
WantedBy=multi-user.target
Please test! Either with or without (just remove the /etc/systemd/zram-generator.conf
file in the butane config) zram would be nice.
@dustymabe Yeah I think we should start to revive the swap-on-zram discussion since we said we'd revisit it again. I'll probably start a new ticket for that.
Created the ticket for that discussion: https://github.com/coreos/fedora-coreos-tracker/issues/859
Maybe I've missed something, it doesn't seem enabled by default yet.
$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
DriverState: active; periodically polling for updates (last checked Thu 2021-06-10 22:02:40 UTC)
Deployments:
● ostree://fedora:fedora/x86_64/coreos/testing
Version: 34.20210529.2.0 (2021-06-01T19:23:21Z)
BaseCommit: d7ad41d882de1a9b5652d29ea69b0aedb83e5dec66cb4ce379ff651af14536ee
GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
LayeredPackages: qemu-user-static
ostree://fedora:fedora/x86_64/coreos/testing
Version: 34.20210529.2.0 (2021-06-01T19:23:21Z)
Commit: d7ad41d882de1a9b5652d29ea69b0aedb83e5dec66cb4ce379ff651af14536ee
GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
$ systemctl status systemd-oomd
○ systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer
Loaded: loaded (/usr/lib/systemd/system/systemd-oomd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Docs: man:systemd-oomd.service(8)
$ grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
Maybe I've missed something, it doesn't seem enabled by default yet.
You're right. I think the system I checked on was one I was using to explore systemd-oomd
a few weeks back.
You're right. I think the system I checked on was one I was using to explore systemd-oomd a few weeks back.
And that knowledge went...out of your memory eh?
Totally evicted!
Maybe I've missed something, it doesn't seem enabled by default yet.
You're right. I think the system I checked on was one I was using to explore
systemd-oomd
a few weeks back.
I think something else that added to my confusion is that the butane config I added in https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-859094939 did work even though I didn't explicitly enable systemd-oomd
. Just adding the systemd-oomd-defaults
package makes it start for some reason, though it's not clear to me how.
Just adding the
systemd-oomd-defaults
package makes it start for some reason, though it's not clear to me how.
@miabbott pointed out to me that adding configuration (like ManagedOOMMemoryPressure=kill
) to any systemd unit will activate systemd-oomd
. We'll need to mask it in cases where we don't want it to be run.
Just adding the
systemd-oomd-defaults
package makes it start for some reason, though it's not clear to me how.@miabbott pointed out to me that adding configuration (like
ManagedOOMMemoryPressure=kill
) to any systemd unit will activatesystemd-oomd
. We'll need to mask it in cases where we don't want it to be run.
OK, this is a point for enabling (or not masking) by default as potentially expected functionality will break if we mask the unit.
OK, this is a point for enabling (or not masking) by default as potentially expected functionality will break if we mask the unit.
yeah, though we should find and fix those bugs in systemd/oomd if they exist. User's should be able to mask it without side effects.
We discussed this in the community meeting today.
@jdoss is working on doing some testing and we'll hear back from him next week.
We did make a small decision:
* AGREED: since oomd works better with swap, let's tie the swaponzram
proposal and the oomd proposals together. If we do one, we do the
other. (dustymabe, 16:50:58)
but we also decided to take a step back and discuss single node versus kubernetes defaults briefly first: https://github.com/coreos/fedora-coreos-tracker/issues/880
@jdoss is working on doing some testing
(below hosts are both on testing-devel/builds/34.20210620.20.0
)
We recently ran into some memory leaks in our core app that caused the app (and eventually the fcos host) to go unresponsive. oom-killer
did do some work, but did not kill the container causing the memory leak, which prevented our app from recovering automatically.
no swap-on-zram or systemd-oomd-defaults:
We just tested this against another host with swap-on-zram enabled + systemd-oomd-defaults installed (using @dustymabe's config; thank you for that unit!): swap-on-zram worked great and the systemd-oomd (w/defaults) killed our offending container as hoped, without any adverse affects on our app.
swap-on-zram enabled + systemd-oomd-defaults installed:
...with a very small/handy entry that we'll be able to easily use for logging/alerting to catch the next memory leak (maybe this will be our last! ha ha):
[core@community (community.testhost.com) ~]$ journalctl -u systemd-oomd -f
-- Journal begins at Thu 2021-06-24 14:06:29 UTC. --
Jun 24 14:18:27 community.testhost.com systemd[1]: Starting Userspace Out-Of-Memory (OOM) Killer...
Jun 24 14:18:28 community.testhost.com systemd[1]: Started Userspace Out-Of-Memory (OOM) Killer.
Jun 24 16:35:32 community.testhost.com systemd-oomd[3455]: Killed /machine.slice/machine-libpod_pod_5cd150410f0f8fd2825bd8232e212ec151f1aaa9e1899762efa5012ebdc522f4.slice/libpod-a20b33b96087e0ce7637845abde6efedee3bd6ac233f4ba98f1eadc4512a51f5.scope/container due to swap used (1854169088) / total (2057302016) being more than 90.00%
From our perspective, this is definitely a nicer experience with those two enabled.
Maybe it's time to revist this now?
I agree. We should enable it by default.
I wanted to play around with installing systemd-oomd on FCOS. Has anyone found any useful documentation on what is needed for that?
I've actually been making a bit of progress of investigating how one could use systemd-oomd, swap and kubernetes on fedora. But not really sure how one installs this on FCOS.
doh! https://github.com/coreos/fedora-coreos-tracker/issues/840#issuecomment-859094939
Describe the enhancement
One Fedora 34 change was for enabling
systemd-oomd
by default. My original understanding was that swap was required for this to work, but that no longer appears to be the case.In an effort to keep differences from Fedora proper to a minimum we should consider enabling
systemd-oomd
.