Open tytgatlieven opened 1 year ago
Show cat /proc/buddyinfo
before and after the problem occurs. I'm seeing similar behavior since some of the latest updates for kernel 6.1. While I don't get the exact same behavior, I'm seeing the kernel generating memory pressure, flushing cache and sending memory to swap despite multiple gigabytes of memory announced as free, resulting in IO thrashing.
buddyinfo
shows how many pages of which size are free (column 1 = order 0 = 4k 2^0, column 2 = 4k 2^1 = 8k, then 16k, 32k...), and if the kernel cannot find a free block that can hold the allocation, it will swap data out or fail the allocation. This is because the kernel itself allocates physical pages not virtual memory (thus, it cannot split big blocks of allocations and merge them virtually into one continuous range of memory: if something doesn't fit, it doesn't fit, this is called memory fragmentation).
Disabling the memory cgroup controller (and using generational LRU instead) somewhat mitigated that for me, setting transparent huge pages to madvise also helped, but it is still not fully fixed for me.
What you want to see in buddyinfo
is high numbers in the high order columns. If it peaks in the low order columns, the kernel should (and will) try to compact movable pages into larger blocks, thus defragmenting free space. This can take a few seconds, so you may want to take multiple snapshots of buddyinfo
. If pages do not migrate to higher order free space, you should check if your system has a lot of non-movable pages or huge pages.
User-space pages are movable (because they are addressed indirectly through page table lookups). Buffers for hardware are usually not movable. Page cache is, I think, neither movable but reclaimable.
Your error indicates it's trying to get an order 9 allocation (4k * 2^9 = 2M) so buddyinfo
probably doesn't have anything beyond the 9th column.
Something bad is going on with memory allocations in the kernel since 6.1.
Sorry for the delay.
It is hard to accurately snap the buddyinfo before and after the event. I have done cat /proc/buddyinfo
continuously while doing a tail -f on syslog and this is the buddyinfo before and after the error. Is there a more accurate way of getting this info?
/proc/buddyinfo before: Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 2 Node 0, zone DMA32 61 50 341 879 741 548 368 171 85 22 1 Node 0, zone Normal 51878 58470 54194 56426 40574 24602 14169 8418 9425 0 0
/proc/buddyinfo after: Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 2 Node 0, zone DMA32 62 48 341 877 740 549 367 171 84 7 0 Node 0, zone Normal 60753 58348 54154 56373 40536 24567 14150 8407 9409 5 0
That indicates that both before and after the event, the memory is already very fragmented - and that it is fragmented before is probably why it is happening in the first place.
Could you look at it after a fresh reboot, then look how it develops while using the system? Maybe you can identify an action or behavior on your system that is causing this behavior.
As a first counter measure you could try disabling huge pages after a fresh reboot (while buddyinfo still shows low numbers on the left side and high numbers on the right side):
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo always | /sys/kernel/mm/transparent_hugepage/defrag
If this helps but you feel like you want to use huge pages (because it lowers TLB cache misses and can increase performance for some workloads by up to 10%), try this as a next step (I am using these settings, it causes around 1 GB of unused memory on my desktop system under memory pressure when memory is partially fragmented, instead of 4-8 GB with huge pages always turned on):
echo 64 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
echo 8 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
echo 32 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
echo within_size | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
It tells transparent huge pages to only create huge pages for madvise memory regions (thus, when an application explicitly asks for it, bees does this for the hash table). It also tells to defer defragmenting huge pages for better latency (but this tends to delay seeing the immediate effects of bad memory layout). Depending on your workload, you may have better results with defrag = always
at the cost of higher memory allocation latency (fine for servers, not so for desktops). within_size
tells to use huge pages only for shared memory if the allocation is at least 2 MB. Again, you may have better result when setting it to never
.
The max_ptes_{none,swap,shared}
tell the defragger when to combine 4k pages into huge pages during compaction: Only if less than 64x 4k pages of a 1x 2M page candidate are NOT YET allocated (max_ptes_none
), compaction will combine those pages into one 2M page (sacrificing up to 64*4=256 kB of RAM). Similar for swap: Compaction to 2M will only occur if no more than 8 pages must be swapped in. And for shared: Compaction to 2M will only occur, if no more than 32x 4k pages would be unshared in the process.
You can cat
each sysfs file to see the current and possible settings so you can experiment with it.
Background: 2M pages create less more fragmented free memory because memory holes for buddy allocation tend to be smaller. It also creates more memory pressure often causing the kernel to flush out cache early and create "seemingly free" memory which actually cannot be used because it is too fragmented. I think exactly this is what you initially described.
Using and tuning huge pages is a question of cost vs benefit: Memory loads become up to 10% at the cost of reduced usable memory. If your system allocates memory in bad patterns, the cost easily becomes very high in which case you may want to disable huge pages completely or identify the process which is causing it. Btrfs itself seems to spike buddy memory allocations quite often which increases the cost of huge pages.
BTW: 2M pages cannot be swapped. They need to be broken up back into 4k pages for swapping. I'm not sure if the kernel does this by default or if there's a tunable for when this should happen.
@kakra
my setup 56GB memory btrfs filesystems:
Tests
I have disabled transparent huge pages immediately after a reboot and disabled most background processes (3GB actual memory consumption, almost no fs usage) and the issue persists, on both filesystems
I have upgraded my kernel to 6.4.0-rc4: issue persists
I have reduced the bees db on my raid from 16GB to 4GB: issue persists
I have verified bees using a 4GB db on my ssd: issue persists
I did notice that a btrfs scrub makes buddyinfo behave similarly as bees, though no vmalloc errors occur
How can we determine if this is a regression in bees when using newer kernels or if it is the kernel's btrfs code used by crawl/dedup itself?
A new traceback:
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334233] warn_alloc: 52 callbacks suppressed
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334239] crawl_258_10978: vmalloc error: size 6291456, page order 9, failed to allocate pages, mode:0xcc2(GFP_KERNEL|GFP_HIGHMEM), nodemask=(null),cpuset=system-beesd.slice,mems_allowed=0
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334257] CPU: 11 PID: 12028 Comm: crawl_258_10978 Tainted: G OE 6.4.0-060400rc4-generic #202305281232
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334261] Hardware name: MSI MS-7760/X79A-GD45 Plus (MS-7760), BIOS V17.9 12/08/2014
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334263] Call Trace:
May 31 14:16:37 ltytgat-desktop kernel: [ 1460.334266]
How can we determine if this is a regression in bees when using newer kernels or if it is the kernel's btrfs code used by crawl/dedup itself?
By definition, user space software must never be able to create kernel oops or traces [1] - so this is a kernel regression. Does it work fine with an older kernel then?
[1]: bees does some efforts to work around such issues, tho - but that doesn't make it bees fault
It does work fine with kernel 6.2.13 (no kernel traces). The buddyinfo does behave identical to the buddyinfo in the non-working 6.3.1 case
I do agree that userspace should never be able to create a oops or trace, so indeed it should not bees's fault.
Actually, I feel like memory fragmentation is becoming a bigger issue when running btrfs with each kernel cycle. I'm currently running 6.1 and see very high order 0 values in buddyinfo, and get oops'es or IO thrashing - while it worked fine in the previous LTS kernel (and thus I never looked at buddyinfo). Using memory cgroups seems to worsen the problem but that may be an effect of using bees and how cache ownership works in memory cgroups.
One of our servers running 6.1 had buddyinfo with order 0 in the millions - and increasing RAM for it only worsened the problem for some reason. This hasn't been an issue with the previous 5.19. With transparent hugepages completely turned off it now behaves mostly as expected but the order 0 numbers are still very high.
There's another metric you could look at: /proc/pagetypeinfo
- but I'm not yet sure how to properly read that.
I have noticed that even without bees running vmalloc errors occur after some time due to other services. Hence, It becomes clear that this is either:
If I do a echo 1 > /proc/sys/vm/drop_caches the low order numbers jump up and are reduced after some time. The order 9 numbers increase, and the vmalloc errors are removed for some time, until the order 9 number are reduced to 0 again.
Does anybody have a contact in the BTRFS development community where this could be triggered?
These are the errors I get:
Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198519] kded5: vmalloc error: size 10485760, page order 9, failed to allocate pages, mode:0x400cc2(GFP_KERNEL_ACCOUNT|__GFP_HIGHMEM), nodemask=(null),cpuset=user.slice,mems_allowed=0
Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198530] CPU: 7 PID: 147280 Comm: kded5 Tainted: G W OE 6.3.7-060307-generic #202306090936
Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198532] Hardware name: MSI MS-7760/X79A-GD45 Plus (MS-7760), BIOS V17.9 12/08/2014
Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198533] Call Trace:
Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198535]
Jun 13 15:24:24 ltytgat-desktop kernel: [112402.962588] bash (161844): drop_caches: 3
This is a good find. Maybe hop over to IRC #btrfs then, there are quite some btrfs devs active.
The bug is:
v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal
The fixes are:
v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
The bug has been backported to LTS, but the fix has not:
v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
So kernel 6.3.10 and 6.4 are good to go, but now the LTS kernels are broken.
I was just going to mention that in 6.5rc1 this issue seems to have been tackled, but then I saw the comment above from @Zygo
Do I understand it correctly that in essence this was not an error, only an unfortunate kernel log message?
in essence this was not an error, only an unfortunate kernel log message?
Yes. The underlying error condition behind the message is expected, and the btrfs code already handles the error cases. The recent kernel code changes are all related to when the message should appear in the log.
There's still a problem with memory fragmentation, no matter the error log.
There's still a problem with memory fragmentation, no matter the error log.
Yes, that's issue #260. Let's keep #257 about the kernel message, and #260 about the thing that is triggering it.
But in https://github.com/Zygo/bees/issues/260#issuecomment-1627586574 you explicitly mention that this is a bug that was backported but the fix wasn't backported yet. Unless I got something wrong...
But in https://github.com/Zygo/bees/issues/260#issuecomment-1627586574 you explicitly mention that this is a bug that was backported but the fix wasn't backported yet.
Also here in https://github.com/Zygo/bees/issues/257#issuecomment-1624096960. That means current LTS kernels 5.15 and 6.1 now have the 'vmalloc error' kernel messages that were fixed in 6.4.
The kernel changes would not affect any VM behavior, other than emitting the log message or not.
Dear I upgraded my system from linux 6.2.13 to 6.3.1. This resulted in the error messages as below in my logs. There are no crashes.
This has also been reported here
In the reply to above it is mentioned that a out-of-memory condition could trigger the issue. In my setup I have about 30GB of free RAM memory, so this shouldn't be the case.
Feel free to contact me for more info/tests.