coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

F33 feature/change proposal SwapOnZRAM by default #509

Closed cmurf closed 3 years ago

cmurf commented 4 years ago

This will be proposed for Fedora 33 all editions and spins https://fedoraproject.org/wiki/Changes/SwapOnZRAM

I'd like to make the case that some swap, especially if it's fast, is better than no swap. In the no swap case, if the system comes under any memory pressure, it means the system must resort to reclaim of file pages because it's not possible to evict even inactive anonymous pages due to lack of swap. Such a system starts to do a kind of swap thrashing, which is really not at all swap related since it doesn't exist, but it's this churn of reading file pages on demand, and then almost immediately they get dropped out of memory. Whereas if there were a swap-on-ZRAM device, those inactive anonymous pages would get evicted, compressed, and free up memory to avoid reclaiming file pages.

Other than totally opting out, I think there are two options Fedora CoreOS could consider:

  1. include zram-generator, but not include a configuration file. Without a configuration file, there is no setup of a swap-on-zram device.

  2. include both zram-generator and configuration; but you could have a different configuration than the default if you want to go even more minimalist than proposed. e.g. maybe use a fraction of 20% RAM instead of 50%.

Note that Fedora IoT has been using swap-on-ZRAM for some time and they're defaulting to 50% RAM which is the same as this proposal.

cmurf commented 4 years ago

It's in the change proposal, but just to put a fine point on it here: when the /dev/zram0 device is created with whatever size, this is not a preallocation. It doesn't actually consume memory right away. There is about 0.1% overhead to create it, but otherwise the memory is dynamically allocated and deallocated based on demand.

cgwalters commented 4 years ago

Note that Kubernetes explicitly fails if swap is enabled: https://github.com/kubernetes/kubernetes/issues/53533

Of course this swap isn't really the same as other swap, particularly if you're doing swap on any kind of rotational storage (but hopefully no one is doing that anymore).

Personally, I think zram is a convenient and cheap approach in some scenarios, but what we really want is to make the operating system behave more like iOS/Android by default and actively evict (i.e. kill) applications (and yes, this requires intelligence in the frame work and apps). Scheduling applications more intelligently is basically what Kubernetes is doing.

cmurf commented 4 years ago

From my reading, Android is uses swap-on-zram already for a while, as well as Chromium/Chrome OS. It's not consistently deployed by OEMs I guess.

https://source.android.com/devices/tech/perf/low-ram

cmurf commented 4 years ago

"Swap can make a system slower to OOM kill". I don't know if this concern is why cloud environments tend to not have swap configured. But oomd2 and likely future systemd-oomd depends on various PSI metrics including swap pressure, i.e. swap needs to exist to do this. I think using zram based swap for this is an open question; I did ask some upstream kernel cgroupsv2 folks about it, and they kinda shrugged and said it all depends, and may even need to be made dynamic based on the workload.

lucab commented 4 years ago

We covered this in the last meeting. There were several thumbs-up on both not having swap by default (i.e. current status) and allowing users to opt-in swap-on-zram (i.e. the F33 way, minus the always-on default).

The general flow that we could be targeting is: 1) write zram conf via Ignition 2) write formatting + mounting units via Ignition 3) let the zram-generator create the devices 4) let the other units format and enable the swap-on-zram

For this to work, we assume that FCOS can generally just follow vanilla Fedora approach here. The only point of contention/customization would be around vendor defaults. zram-generators does not currently support the whole set of fragments and overlays like other systemd components, and I opened https://github.com/systemd/zram-generator/issues/29 to push that forward.

@cmurf I didn't see the change-ticket for this F33 feature, but IMHO it would be nice to bring up https://github.com/systemd/zram-generator/issues/29 as a soft-blocker there.

PS: I'm deliberately ignoring the whole "is swap default better on or off" discussion here. I'd like to keep this ticket focused on the swap-on-zram topic.

cmurf commented 4 years ago

@lucab The feature is still brand new, so it doesn't have a change tracking bug yet and hasn't yet been approved by FESCo. I agree with the approch in zram-generator#29 but leave it up to CoreOS folks to decide whether to ship a missing /usr config to indicate disabled by default, or if you want ignition to drop an empty file into /etc by default to indicate it.

lucab commented 4 years ago

For reference, the discussion here brought to light CVE-2020-10781 (local DoS, fix upcoming).

dustymabe commented 4 years ago

Nice work Luca!

lucab commented 4 years ago

https://github.com/systemd/zram-generator/pull/33 added support for configuration fragments. If/once this proposal land in Fedora, we just need to add a vendor fragment to disable the default configuration, and then document how people can opt-in again into that.

cmurf commented 4 years ago

I think what you'd do is install zram-generator package, and not install zram-generator-defaults package. The generator will be present but do nothing. And then you can opt in by any means of creating /etc/systemd/zram-generator.conf that you wish.

dustymabe commented 4 years ago

Now that I know a little bit more about how swap on zram works i think we could consider enabling it by default after getting some real world experience with it. It could also be something we do at a later time (opt in for now, default to later). It would be nice if it could give us some wins in environments with less resources.

cmurf commented 4 years ago

Fedora IoT has been enabling it since the start. These defaults are a bit more conservative considering the 4G cap, which are subject to change. I've talked to a few kernel fs/storage/mm/cgroups folks about it and it's definitely better than no swap. Eviction at 50% efficacy (based on compression ratio estimate) is not as good as 100% efficacy using disk-based swap; but is still better than 0% which causes anonymous pages to be pinned to memory, and increases the chance of otherwise unnecessary reclaim. So even without swap you can get "swap like" behavior, and repetitive reclaim is expensive.

dustymabe commented 4 years ago

A bit more information: The configuration for the zram-generator has a setting:

# The maximum amount of memory (in MiB). If the machine has more RAM
# than this, zram device will not be created.
#
# "host-memory-limit = none" may be used to disable this limit. This
# is also the default.
host-memory-limit = 9048

So hosts with more than $host-memory-limit RAM will see no change if we were to implement this. I think currently we've accepted that including the zram-generator package (pending any security or bug fixes that are found) is something we want to do.

The real question is:

lucab commented 4 years ago

@dustymabe I'd like to answer "yes/maybe" here, but my understanding is that all higher level orchestration systems (e.g. k8s, nomad, etc.) basically assume a "no". Their memory accounting and scheduling logic usually does not cover the swap case as it makes the logic way more complex (a hierarchy of memory pools with different access properties), see Colin's first comment. In short, I doubt we have freewill on the default value here right now, similarly to the cgroupsv1 case.

dustymabe commented 4 years ago

I think what you're saying is reasonable.

dustymabe commented 3 years ago

ok so it seems like we are leaning towards included but not enabled by default. We have two options for that that I see:

  1. Include both zram-generator-defaults and zram-generator packages. Place override at /etc/systemd/zram-generator.conf to disable.

This means in order to enable the defaults the user just deletes the /etc/systemd/zram-generator.conf file. Documentation is slightly easier.

  1. Include just the zram-generator package.

In order to enable zram you'd need to create a file at /etc/systemd/zram-generator.conf with at least [zram0] in it. In this case documentation is slightly longer and probably needs to explain the contents of the file briefly, which might be desirable anyway.

dustymabe commented 3 years ago

ok so it seems like we are leaning towards included but not enabled by default.

Though one thing we could do is create our own FCOS config with host-memory-limit = 4096 so it would only be enabled on systems with less than 4GiB of ram by default (or some other ram value we deem appropriate).

bgilbert commented 3 years ago

Though one thing we could do is create our own FCOS config with host-memory-limit = 4096 so it would only be enabled on systems with less than 4GiB of ram by default (or some other ram value we deem appropriate).

It seems like it could surprise users if we enable a potentially Kubernetes-breaking feature only on machines with certain amounts of RAM.

cmurf commented 3 years ago

Maybe ask Kubernetes users if noswap actually manifests well in practice? It's leaving a lot on the table to either insist on significant overprovision of memory to avoid both the need for page eviction and reclaim, or suffer with reclaim which can be worse than incidental paging, especially when using a memory based swap. The noswap by default policy translates into an expectation to throw more memory at such setups (and pay for it).

I think it's better to optimize for the general purpose use cases CoreOS is targeting, rather than papering over Kubernetes design oversight. That is, they fail on swap because they haven't worked out how swap should look with the semantics of guaranteed pods, not because there's some real technical limitation of swap.

dustymabe commented 3 years ago

We discussed this in the meeting today.

13:05:25      dustymabe | #agreed We'll include the zram-generator package for now, which will allow
                        | users to drop down a config file to enable swaponzram. Additionally we'll
                        | add docs to show users how to do this. In the future we'll re-evaluate if
                        | creating a swaponzram device by default, is the right thing for us to do.
dustymabe commented 3 years ago

The fix for this went into testing stream release 32.20201018.2.0. Please try out the new release and report issues.

dustymabe commented 3 years ago

The fix for this went into stable stream release 32.20201018.3.0.