Open awesomekling opened 1 year ago
Yesterday, on Linux, I ran a program that gobbled up too much RAM and it crashed my window manager, killing every application I had open. Quite a disruptive experience. This is clearly a failing of Linux. Will SerenityOS do better?
That's a really good point, IMO. Services should still worry about OOMs because it's relatively easy to do so there (they're connection-based applications which means you can just reject new connections in an OOM situation). That would mean WindowServer
wouldn't crash but instead reject new clients when the system runs out of memory, for example.
Maybe crashing can be avoided? Is it feasible to reserve some amount of RAM to keep services running, but for GUI applications just freeze them until more RAM is available? This will allow user to choose which app to close or just wait until some process finishes.
Yesterday, on Linux, I ran a program that gobbled up too much RAM and it crashed my window manager, killing every application I had open. Quite a disruptive experience. This is clearly a failing of Linux. Will SerenityOS do better?
Yes, it should be (relatively) easy to do better by simply organizing processes into priority tiers like iOS does with its jetsam property system.
Vital system services like window management, clipboard, etc, should be prioritized as such, which would prevent bloated apps from taking them down.
IMO the problem on Linux isn't overcommit itself, but rather that the OOM killer doesn't have good metadata to inform its decisions, and apps don't get informed/updated about memory limits in a structured way.
That's a really good point, IMO. Services should still worry about OOMs because it's relatively easy to do so there (they're connection-based applications which means you can just reject new connections in an OOM situation). That would mean
WindowServer
wouldn't crash but instead reject new clients when the system runs out of memory, for example.
Note that WindowServer
still has the issue that connected clients might ask for additional resource to be allocated. "Make me a full-screen window" for example, and now we have a bunch of new allocations that may fail.
That said, I totally agree that we should do what we can to keep WindowServer
up, including rejecting new clients when resources are low, responding to memory pressure notifications by evicting caches etc, and ofc setting WindowServer
to a high priority that means the kernel will almost anyone else first.
That sounds good overall. We can't do much about existing connections causing OOMs, unless you want to make WindowServer requests fallible (which is kind of what we're trying to battle against here in the first place :sweat_smile:). Existing connections can be dropped in that scenario I suppose.
Do you think we should also separate services into these "tiers"? Right now we don't really handle connection drops in IPC connections so we would need to harden that area before letting lower priority services crash. For instance, ClipboardServer
is probably OK to OOM-kill since the blast radius would be smaller than WindowServer
, assuming the necessary hardening is in-place.
Yes, I'm thinking the tiers would really be some kind of priority number, e.g in the 1-100 range, where 100 would be "keep alive at all costs" and 1 is "someone wants to mmap and I need more pages, you are now dead".
This would allow us to establish importance order between our services, since as you say, some services are far more important than others.
Another thing I would like to borrow from Apple platforms is the process clean/dirty flags where a process can mark itself as clean, meaning "I have no important in-memory state and I can be killed whenever if you need my resources". This would allow e.g Clipboard
to write out its state to a file, and then mark itself clean. A bit like purgeable memory but for entire processes.
Rick Byers from the Google Chrome team had an interesting thread about changing the mental model around this: https://twitter.com/RickByers/status/1689715401725235200
(Disclaimer: Not a SerenityOS developer, feel free to take or leave this.)
That said, I totally agree that we should do what we can to keep
WindowServer
up, including rejecting new clients when resources are low, responding to memory pressure notifications by evicting caches etc, and ofc settingWindowServer
to a high priority that means the kernel will almost anyone else first.
Could other processes benefit from some sort of “memory is low, please free up resources” notification? That way processes can agressively cache data in memory for performance when memory is plentiful, but if memory becomes low, the caches are dropped before anything gets killed.
The obsession with
ErrorOr
,TRY()
andmalloc(8)
possibly failing is making our codebase ugly and annoying to work on.This is basically my fault, since I started the trend by falling in love with
TRY()
in the kernel and then bringing it to userland without properly considering the cost/benefit.Meticulous OOM handling and propagation is vital in some contexts, such as:
Gfx::Bitmap
)However, I don't believe that we're gaining anything by obsessively worrying about every tiny heap allocation possibly failing.
Also, in complex libraries that are essentially virtual machines like LibWeb, LibJS, and LibPDF, making a meaningful recovery from a tiny OOM is basically impossible. We are better off crashing the program at that point (and keeping the crash contained UI-wise by way of process separation).