kswapd0 going crazy - Githubissues

simonschnake commented 8 years ago

Currently I haven't the pleasure to test GalliumOS, but this was allways a Problem with the earlier Ubuntu Versions from HugeGreenBug. When the RAM is full, the kernel process kswapd0 starts using one kernel up to 100%. Maybe you had fixed this problem or it's a special one for my Toshiba Leon Chromebook, but if not it would be great if you look into that. Sincerely, Sim

ghost commented 8 years ago

Please confirm that the issue still exists in GalliumOS when you get the chance and I'll reopen it.

hugegreenbug commented 8 years ago

A lot of additional optimizations have gone into GalliumOS compared to the Ubuntu 15.04 based distro on distroshare. I have tested in 2GB and 4GB configurations and haven't noticed unusual kswapd behavior.

reynhout commented 8 years ago

Reopening with new data. Lots of information at: https://www.reddit.com/r/GalliumOS/comments/404o4e/kswapd0_at_99_observations_and_conclusion_kernel/

reynhout commented 8 years ago

Problem appears most commonly when running Chrome/ium, presumably due to the process-per-tab model, effectively running multiple large memory processes simultaneously. Has been repro'ed without Chrome/ium, but only with greatly reduced RAM enabled.

@hugegreenbug has suggested testing Chrome/ium in a cgroup to see if it contains the runaway.

Some kernel versions apparently do not exhibit the behaviour, but the GalliumOS current (4.1.14) is one of those that do. If we can identify a better candidate kernel, that might also be a path to pursue.

marcosfede commented 8 years ago

downgraded to kernel 4.0.8-hgb and the issue is much less frequent, it still shows up in certain occasions (less frequently) under high memory pressure, but goes away much faster when closing chrome tabs

reynhout commented 8 years ago

Some more observations: I can repro the problem reliably under GalliumOS kernel 4.1.14 and 4.1.6, swappiness 10, when swap is enabled (tested with zram).

I cannot repro the problem at all under mainline kernel 3.19.current, nor GalliumOS kernel 4.0.8-hgb (which was never released as a stable package). @marcosfede: On 4.0.8, I did see some kswapd activity, but it never exceeded 30% or so, and it went right back down to near zero within a couple seconds without freeing any memory.

I was having trouble triggering it consistently under Chromium, so I made some test code to allocate 50-150MB at a time, scaling up and down under user control. This made it very easy and reliable to trigger.

In all cases, the system would absorb large allocations up to about 2.7GB (0.2GB already in use by the OS), but then any additional allocation larger than the current freemem size made kswapd jump to 99% in the span of a few seconds, and stay there until some chunk of memory was freed. As observed previously, it's not necessarily the size freed, and it's rarely the most-recently-created that needs freeing.

My 2GB C720 was completely functional at all times, no lag at all -- except for one time when I managed to completely wedge something, the window manager was hung of course, but also no response on tcp ports. Had to force reset that time.

Other interesting bits: zram is almost like magic, but could have done a much better job compressing multi-megabyte allocations of a single character (for the purposes of this test, I'm glad it didn't do better...reading /dev/random would be slow!).. Malloc started returning errors at about 3.5(+0.2)GB total.

Next up: cgroups and other kernels. I'm guessing that cgrouping Chromium child pids will help only if the total memory use stays below swap threshholds. I don't know if it would be a good idea to cgroup kswapd0...in my measurements, normal usage stays well under 30% almost all the time, and containing it there would be fine (perhaps introduce occasional pauses)...but obviously it doesn't address the real problem, and I'm not sure what happens if kswapd has real work to do but is constrained by all the thrashing AND a cgroup limit. Perhaps no worse for kswapd than thrashing with the natural single-thread limit, and better for the user?

hugegreenbug commented 8 years ago

It could be that the scheduler patches in the galliumos kernel are making the issue worse. However, I have had reports of the issue in my old distroshare distros that were using 3.xx kernels and didn't have the patches.

An easy way to test the croup changes is with the systemd slices. The user slice is: /sys/fs/cgroup/memory/user.slice/ and there is a slice for each user. Yours is probably: /sys/fs/cgroup/memory/user.slice/user-1000.slice . In each slice directory, there is a memory.swappiness and a memory_limit_in_bytes file. You can try to adjust those and see if they make a difference.

If you boot the kernel with much less memory (e.g., 512M), I would guess that you will see the issue on every kernel. You can do that with the kernel parameter: mem=512M

hugegreenbug commented 8 years ago

Actually, I don't think the slices work at all with the bfs scheduler. So, never mind.

hugegreenbug commented 8 years ago

@marcosfede I have been trying to reproduce the issue for a long time and I can't. I don't have a 2GB model though, but I have been reducing the memory of my 4GB to 2 with a kernel boot param. I was wondering, are you using the stock firmware?

marcosfede commented 8 years ago

@hugegreenbug do you mean the regular seabios? yes I'm using the stock firmware then. don't think this is a Gallium problem, this problem happens to me with every distro I've tried. (that uses a 4.x kernel), even plain arch, with or without swap partition. @reynhout could you share your code to trigger kswapd0 on demand for testing purposes?

hugegreenbug commented 8 years ago

@marcosfede I meant the whole coreboot rom. John Lewis has a full replacement coreboot rom for the c720. I have no idea if that will make difference. I just find it strange that I can't reproduce it when I lower the memory to 2GB. I see kswapd increasing in cpu usage, but it goes back to normal immediately after I kill a process. If it was a kernel bug related to memory management with only 2GB of ram, then I would be able to reproduce it. It must be something else. We should probably trace the kernel to see what is going on.

I know it isn't specific to GalliumOS. Our goal is to make this the best Linux distro for ChromeOS devices. That includes trying to fix as many bugs as possible that affect ChromeOS devices.

reynhout commented 8 years ago

@hugegreenbug Do you think it's at all possible for the MMU to hide the problem from the kernel, when there's excess physical memory to absorb it? I know that the MMU operates autonomously, but I don't know if the kernel relies on hardware error flags for any of its internal memory management.

That kernel parm isn't the absolute limit, or else zram wouldn't work -- and there are also the kernel tunables that control when and how to fail allocation requests...but I'm not sure how far the request/alloc/return cycle really reaches toward the CPU/MMU.

Also, FWIW, my 2GB C720 only reports 1877MB total.

@marcosfede I'll attach the code to this ticket. It's a little C program that just allocates a block of memory and holds on to it until killed (or the sleep timer expires). And then a set of shell functions to start the allocators, kill them, etc:

## setup
gcc -o memalloc memalloc.c
source memf

## usage   
a 25     ## starts 25 memallocs (default 100MB each)
a 10 50  ## starts 10 memallocs, 50MB each
k        ## kills all running memallocs
c        ## counts running memallocs, but doesn't calculate sizes.

I watch the system state in top:

top -d 1
## (inside top, type "O" and then "COMMAND=kswapd" to watch just that process)

...can't attach these file types, so:

/* memalloc.c */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

main(int argc, char *argv[])
{
  int block_mb = 100;
  char *buf;

  if ( argv[1] ) {
    block_mb = atoi(argv[1]);
  }

  printf("allocing %dMB: ", block_mb);
  buf = malloc(block_mb * 1024 * 1000);
  if (! buf) {
    printf("FAILED!\n");
    exit(EXIT_FAILURE);
  }
  printf("ok\n");
  memset(buf, 'x', block_mb * 1024 * 1000);
  sleep(180);
}

## memf
function a() { i=1; while [ $i -le ${1:-1} ]; do ./memalloc $2 & i=$((i+1)); done; }
function k() { killall memalloc; }
function c() { ps auxww | grep memalloc | grep -v grep | wc -l; }
function f() { while [ 1 ]; do echo; free -m; sleep 1; done; }

hugegreenbug commented 8 years ago

@reynhout I don't know enough about the mmu, the kernel, or the firmware to know the answer.

I'm not sure what you mean about the absolute limit. Do you mean physical memory + swap? The mem kernel boot parameter sets a limit on the physical memory the kernel uses. Zram allocates creates a swap space that is 1.5 * physical memory size. I've also tried 1GB and 512M. With 512M, kswapd works a lot harder, but there is much less memory to work with. It still never gets to 99% cpu usage randomly.

reynhout commented 8 years ago

I mean the absolute limit when allocations start failing. I thought of zram because that limit isn't just the sum of kernel-configured RAM plus swap. The kernel gets a chance to approve or deny allocation requests by testing availability in some way (and also subject to those tunables). Zram is part of the kernel ultimately, so it might just be math based on the hard numbers plus some estimate of compression efficiency.

Anyway, I don't really know what I'm talking about here. Just speculating that a kernel configured limit might not trip the same management code as an error returned from the CPU/MMU, if that happens. And that the MMU might not set error flags because it has plenty of physical space free.

I'll try triggering the problem on kernel-configured 1024MB to see if there are any changes in behaviour.

hugegreenbug commented 8 years ago

@reynhout I see what you are saying now. I'm not sure how the kernel approves and deny requests. It is possible that I'm not testing the same thing.

reynhout commented 8 years ago

I was not able to trigger the problem with mem=512M and mem=1024M. I saw increased kswapd activity, but no instances where the process would hit 99%+ of CPU and stay there indefinitely. After removing the kernel limit, I was able to trigger it again at will. Here's a quick video https://galliumos.org/tmp/kswapd-demo.mp4

I only let it run for a few secs in the video, but I've previously let it ride for hours in that state (though only a couple of times). I've never seen it recover after it's been pegged for more than 10-15 seconds, until allocators are killed. I've killed 10 of them (1GB) without any change in kswapd, but then one more, seemingly random certainly unpredictable, will quickly cause a full recovery to 0%.

hugegreenbug commented 8 years ago

I just found this: http://lkml.iu.edu//hypermail/linux/kernel/1601.2/03564.html. So, never mind about the firmware thing. I will make a test kernel with the recommended change.

hugegreenbug commented 8 years ago

@reynhout That's interesting. I must not have set the right amount then. As you said, you don't really have 2GB available.

reynhout commented 8 years ago

Awesome, that link definitely sounds like it's describing the same problem. From today, no less.. Nice find! :)

hugegreenbug commented 8 years ago

Well, based on this issue and that kernel mailing list link, I did a diff on 4.0 compared to 4.0.8, instead of 4.0 and 4.1rc1. This was the only difference that seemed relevant: https://drive.google.com/file/d/0B6zPD2kAJoTJUGViSFd6dDVUc00/view?usp=sharing . Here is a test kernel with the patch: https://drive.google.com/file/d/0B6zPD2kAJoTJeXVVVnFpb0dzamc/view?usp=sharing . The recommended function from the lkml thread wasn't missing in 4.0.8 and I'm not sure that it would apply as it was for transparent huge pages which are supposed to be disabled by default, so I didn't include it in the patch.

hugegreenbug commented 8 years ago

I see that @reynhout couldn't reproduce the issue on 4.0.8, so it probably wasn't a good comparison.

hugegreenbug commented 8 years ago

That patch could still make a difference. If it doesn't, then there is too much that changed between 4.0 and 4.1.14 for me to guess. I would just put all of the 4.0 memory management code in 4.1.14.

marcosfede commented 8 years ago

Just tried your kernel, The issue is still present. How do I force the bootloader to ask me which kernel I want to boot on startup?

hugegreenbug commented 8 years ago

Ok, thanks. Press Esc right after seabios seabios starts to boot from a device.

marcosfede commented 8 years ago

some videos with different kernels, using @reynhout code kernel 3.19.0.74-generic: no problems test kernel @hugegreenbug provided regular 4.1.6-galliumos:

tried to also test the 4.0.8-hgb I've tried before but wasn't able to find it, maybe removed from the repos? let me know if there's anything else I can test thanks :)

marcosfede commented 8 years ago

nevermind, found the 4.0.8-hgb, and now i'm unable to reproduce the bug! seems that the problem is somewhere between 4.0.8 and 4.1.6 video

hugegreenbug commented 8 years ago

New test kernel: https://drive.google.com/file/d/0B6zPD2kAJoTJSGFQWG45UnhQb1U/view?usp=sharing . This kernel has the mm/vm* files from 4.0.

reynhout commented 8 years ago

Still seeing the same behaviour on kernel 4.1.14-kswapd-test_2. It might be slightly less easy to trigger, but not enough to be certain.

I went back and tested 4.0.8-hgb again to make sure I wasn't misreporting, but again 4.0.8-hgb is solid. kswapd clearly gets active, but never pegs.

Will try some more interstitial versions to see if I can learn anything.

hugegreenbug commented 8 years ago

Ok. I'll have another test kernel tonight with more from 4.0. I still can't reproduce it.

reynhout commented 8 years ago

4.0.9-mainline looks good, 4.1.0-mainline is bad. 4.0.9 is the last release of the 4.0 series, but I'll try the 4.1 release candidates to see if it's possible to narrow the diff range usefully.

reynhout commented 8 years ago

Looks like 4.1.0rc1-mainline also has the bug. The early 4.1 series kernels seem to be worse than the later 4.1 series -- when triggered in the early versions, kswapd doesn't always go back down even when all allocators are killed. I didn't test betwen 4.1.0 and 4.1.6, but that behaviour is improved in 4.1.6.

hugegreenbug commented 8 years ago

New test kernel with only the recommended fix from the kernel bug report: https://drive.google.com/file/d/0B6zPD2kAJoTJRkVwOS14OWYwRms/view?usp=sharing

reynhout commented 8 years ago

New kernel looks good! I ran about a dozen tests, and kswapd stayed under control in every case except one, which happened to be the first test. Since I was expecting the usual bad behaviour, I didn't give it much time after it hit 100%, but instead allocated a few more chunks...and kswapd immediately recovered!

On subsequent tests I was not able to get it to peg at all, though it did jump to 50-70% for a second or two. It always recovered on its own, no memory released nor allocated.

So it looks like this is fixed -- it's possible that in normal use it might still trigger occasionally, but it might just be "working" and it will either recover on additional allocation, or heal itself if given a little bit of time. This might or might not be the way it's expected to work/has always worked/etc...but it definitely is a huge improvement over where we were yesterday.

I'll do some more testing today, but @hugegreenbug you should reply on that kernel thread to let people know that this is (at least) a big part of the solution.

hugegreenbug commented 8 years ago

It sounds pretty normal. Ok, I'll reply. Thanks!

marcosfede commented 8 years ago

tested your last kernel, works great! I do see some kswapd0 activity but it appears pretty normal and it quickly goes back to 0. will this fix be included in future Gallium kernels? or should we wait until they fix this upstream? thanks for your time guys

hugegreenbug commented 8 years ago

@marcosfede That's great. Yes, it will be included in future GalliumOS kernels. I've just added the patch to the kernel in testing.

marcosfede commented 8 years ago

Did some more testing, opening lots of tabs with chrome this time, instead of using the c program to allocate memory. While this kernel is much much better in terms of kswapd0 cpu usage, I still can manage to trigger it sometimes, I loaded like 30 tabs until my memory couldn't handle it and kswapd0 stays at 40-60% . in one oportunity it stayed at 100% even when I closed about half my tabs, until it I closed another one and then dropped to 0. in other tests, cpu usage is high, but does not get to 100%, and it goes away when I free memory. Just writing this here to let everyone know that while this is not a complete fix for the problem, the kernel fix helps a LOT with this issue (at least for me).

hugegreenbug commented 8 years ago

@marcosfede Ok. Also /u/wobh is still experiencing the issue: https://www.reddit.com/r/GalliumOS/comments/404o4e/kswapd0_at_99_observations_and_conclusion_kernel/cz9ugta?context=3.

Since the patch seemed to make a difference, the issue could be due to transparent hugepages, as that is what the patch was for. I found that transparent hugepages were set to always be enabled. Here is a test kernel with them disabled: https://drive.google.com/file/d/0B6zPD2kAJoTJOUxOWDVQcUhyZFU/view?usp=sharing .

I also found in the documentation for transparent huge pages that the value in /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap could cause excessive swap: https://www.kernel.org/doc/Documentation/vm/transhuge.txt . If this test kernel proves to be successful, we could either keep transparent huge pages disabled or try to tune it.

reynhout commented 8 years ago

I've been able to trigger kswapd pegging CPU again, in AbiWord (no other user apps open except the Xfce terminal and top) by copy/pasting 3000 pages of text (write a sentence, select all, copy, down arrow, paste, select all, copy, down arrow, paste ...). It takes a while, because AbiWord is also consuming 99% CPU, and crashed a couple times. Eventually AbiWord settled down but kswapd stayed high for 20+ mins afterward, until AbiWord was closed.

Reopening the saved file in AbiWord was not a problem.

I've been trying to find a simpler test case, but so far failing. I can get kswapd to sit at 40% by allocating thousands of small chunks (0.25-2MB), up the the very limit of possible memory allocations...but no higher than 40%. My guess there is that everything is working properly, and zram is working hard to make things better but has such little work space it isn't making progress quickly.

I will test the new kernel as soon as I can make something interesting happen.

hugegreenbug commented 8 years ago

Kirill A. Shutemov asks if we could provide the values of /proc/sys/vm/min_free_kbytes before and after the patch we just tested.

reynhout commented 8 years ago

2GB C720, immediately after boot, with GalliumOS sysctl.conf as delivered:

4.1.14-galliumos (release): 5391
4.1.14-galliumos-kswapd-test_3: 67584

reynhout commented 8 years ago

I'm going to close this as "improved". The problem lies in upstream kernels, so we will continue to watch for fixes that we can backport or upgrade to.

Please open a new ticket referencing this one, if there is new actionable information.

marcosfede commented 8 years ago

some news and improvements on this thread https://bugzilla.kernel.org/show_bug.cgi?id=65201

hugegreenbug commented 8 years ago

I guess we could do that (i.e., change the memory to be less than physical memory) until they fix it. I'm a little concerned that different machines may report different amounts of physical memory even though they each have 2GB. I suppose a loss of a few MBs isn't a huge deal. So, we would need to find a number that works for all.

GalliumOS / galliumos-distro

kswapd0 going crazy #52