Memory Errors with stemcell 1.351+

max-soe commented 4 months ago

In the past days we saw memory errors with the newer stemcells 1.351+. Let's use this ticket to document all findings and decide how to mitigate this issue.

Slack discussion: https://cloudfoundry.slack.com/archives/C02HWMDUQ/p1707492827160649

ChrisMcGowan commented 4 months ago

Some of this is in the slack thread above but here is our timeline, versions, and observations.

Tech details: IaaS AWS Govcloud - Bosh directors at 280.0.14 and stemcell 1.351. Diego-cells are m5.2xlarge and using a memory overallocation of 50 gb vs the physical ram of 32 gb. The platform has used this configuration for a few years now without issue.

Back on Jan 13/14th we started to see errors in /var/log/kernel/log for Memory cgroup out of memory by OOM on deigo-cells. At the time we where on stemcell 1.329 which was linux kernel 6.2.0-39 and cf-deployment v35.3. This deployment combo was done on Dec 28th of 2023.

Jan 23rd we moved to stemcell 1.340 which was still on linux kernel 6.2.0-39 and cf-deployment v37.0.0. The memory cgroup out of memory errors continued in the kernel.log. At this time we had no reported issues from our users.

Jan 30th we moved to stemcell 1.351 which was now linux kernel 6.5.0-15 and cf-deployment v37.2.0. The memory cgroup out of memory errors continued in the kernel.log but increased. At the same time our users started to see OOM errors while staging their apps as well as an increase in running instances crashing with OOM errors as well - some users where seeing Instance became unhealthy: Liveness check unsuccessful: failed to make TCP connection to <silk cidr addr>:8080: dial tcp <silk cidr addr>:8080: connect: connection refused (out of memory); process did not exit in cf events. These crash events seemed to be spread out, not localized to specific diego-cells and happened on apps using various buildpacks. Looking at Prometheus, the value firehose_value_metric_bbs_crashed_actual_lr_ps increased as well.

Feb 1st we kept the same versions as Jan 30th but we increased deigo-cell capacity by 10%. No change in errors or reported issues.

Feb 7th we deployed cf-deployment v37.3.0 but where still on stemcell 1.351. No change in increased kernel.log errors and user reported issues with OOM.

Feb 8th we deployed stemcell 1.360 which was linux kernel 6.5.0-17 and was suppose to contain the fix noted here. This deployment was also done with cf-deployment v37.4.0. Again no changes to the amount of kernel.log errors and staging errors but crashing instances increased, leveled out, but are still elevated in firehose_value_metric_bbs_crashed_actual_lr_ps.

On Feb 13th we updated our deployment to increase staging memory limit from the default 1024 to 2048. We also expanded our diego-cell capacity another 10%. Versions of stemcell and cf-deployment stayed the same as Feb 8th. The kernel.log errors stayed the same. For users that did a cf restage of their apps - most stopped getting OOM errors but a few still to happen. We are still looking closer but the consistent failing one is based on the node-js buildpack. The number of instances still crashing did drop looking at the firehose_value_metric_bbs_crashed_actual_lr_ps metric but still elevated from before the event. When looking at one of the new crashes, we found it was on a new/just added diego-cell which only had 10 running instances on it and this app instance ran and crashed twice on this cell with the same kernel.log errors.

During this whole event, looking at Prometheus metrics from BOSH, the diego-cell average physical memory usage stayed around 50-65% - typical for what we have seen in the past. No metrics, log entries, BOSH HM events indicate any of the diego-cells ran out of physical ram. Prometheus metrics on diego-cell capacity of allocated vs available memory showed we where between 50-70% of allocated in use. The first capacity add of cells was to lower that amount closer to 50%. The second add was to see if some additional cushion would help - it didn't.

The change of stemcell kernel from 6.2 series to 6.5 seems to have made the problem a lot worse. Increasing staging memory is a band aid for staging apps but not the increase in crashing instances with OOM errors.

What we have not narrowed down yet was why the start of the kernel.log errors on Jan 13/14 when we were still on stemcell 1.329 using kernel 6.2.0-39. The bug fix from above noted having to roll back to kernel 6.2.0-35 so maybe 6.2.0-39 had a bug as well but took longer over time to manifest ? Going back over 6 months there are zero hits on these errors in our log archive. Another CF user reported rolling back to stemcell 1.340 helped to remove most of their issues but going back to 1.340 from 1.360 re-opens at least 1 high, 2 high/med, 4 med/high CVEs according to release notes.

PlamenDoychev commented 4 months ago

Dear Colleagues,

CF Foundation versions used: CF-Deployment v37 linux stemcell v1.351, tested also on v1.360

From Cloud Foundry side we noticed the following symptom affecting CF apps. During CF staging process we observed a large number of apps (using different buildpacks) failing with OOM.

   Exit status 137 (out of memory)
   Cell f8e8a121-82a1-4d0a-a98e-32aa29ae483d stopping instance 23a3b93a-2ce2-4470-89e5-7c89fffd3508
   Cell f8e8a121-82a1-4d0a-a98e-32aa29ae483d destroying container for instance 23a3b93a-2ce2-4470-89e5-7c89fffd3508
Error staging application: StagingError - Staging error: staging failed
FAILED

Based on our investigation we validated that the current staging container mem limit 1024 MB is not sufficient enough to stage an application which previously have been staged successfully.

In order to workaround the issue we:

We noticed that in some cases increasing the requested app memory solves the issue. I.e. Dummy hello-world application which usually requires 500 MB constantly was failing to stage. By increasing the requested mem to 1 GB the app was able to stage. Unfortunately this solution does not work in general and isn't acceptable for customers.
Had a plan to adjust a bit the general configuration for staging container size: https://github.com/cloudfoundry/capi-release/blob/96fda367a817aaccbfc4c735db0ab81882066a5c/jobs/cloud_controller_ng/spec#L927. By doing this most probably the problem would have been solved but actually the main issue should still be in place.
Decided to downgrade the stemcell to 1.340 as a last known good. This resolved in general the staging issues.

schindlersebastian commented 4 months ago

same here with 1.360:

Exit status 137 (out of memory)
StagingError - Staging error: staging failed
FAILED

As @ChrisMcGowan and @PlamenDoychev mentioned increasing the memory as workaround solves the issue...

ChrisMcGowan commented 4 months ago

Any new updates from any of the working groups or new band-aids folks have found ?

Stemcell 1.379 was released the other day but the kernel bump is minor to 6.5.0-18 so i'm not expecting much if any changes to the issue. We still plan to roll that stemcell out into production.

Just for reference for folks rolling back to 1.340 or older stemcells, the switch of the kernel from 6.2.X to 6.5.X was due to the 6.2.X kernel is now EOL - see: https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle. Rolling back to something on EOL and losing patched CVEs is a deal breaker for us.

cunnie commented 4 months ago

FYI, Lakin and I have discovered that the OOMs occur with total_cache bumping up against the 1GB memory limit of the cgroup. It appears to be a problem with cache eviction not working properly. We have 25 OOMs on a 8-core 16 GB 1.379 vSphere Diego cell, and the total cache ranges from 925 MiB to 972 MiB. On the earlier stemcells, total footprint (including cache) rarely exceeds 500 MiB.

cunnie commented 4 months ago

Synopsis: We're still troubleshooting the OOM problem.

2/27: We've managed to get an OOM on a regular Jammy VM (not a stemcell). This helps Canonical replicate the problem (they don't need to stand up an entire Cloud Foundry foundation); however, there are some caveats:
- We only saw 2 OOMs over the course of several hours on the vanilla Jammy VM; that stands in stark contrast to ~17/hour on the Diego VMs — we're missing something.
2/26 Reviewed the change in the Linux kernel's mm/memcontroller.c in the Linux kernel from v6.2 → v6.5 hoping to glean some understanding of what might be causing the OOMs on v6.5. I came up empty-handed.
2/25 We managed to replicate the problem on a 4-vCPU system; we don't need an 8- or 16-vCPU system.
2/25 Unsuccessfully attempted to eliminate the OOMs by using the sysctl job in the os-conf BOSH release by setting vm.swappiness and vm.dirty_background_ratio
2/25 Unsuccessfully attempted to eliminate the OOMs by "pre-heating" the Diego Cell (consuming the free RAM and then releasing) in a BOSH pre-start.
2/24 We managed to replicate ~100 OOMs over the course of several hours on two Diego VMs.

cunnie commented 4 months ago

Synopsis:

3/1 Have not heard back from Canonical.
2/29 We were able to enhance our OOM-replication for Canonical to troubleshoot — on a regular Jammy VM, we were able to OOM almost every time. We accomplished this by lowering the memory limit from 1GiB to 512MiB. This makes it easier for Canonical to debug.
2/29 We were able to confirm that OOM does not occur on v2 cgroups (CF uses v1 cgroups). This indicates the problem may be a v1 cgroups + 6.5 kernel problem.

cunnie commented 4 months ago

FAQ: OOM Errors on New Jammy Stemcells During Staging

The most-recent set of stemcells (Jammy 1.351+) have introduced intermittent OOM (out-of-memory) failures when staging Cloud Foundry applications. Though intermittent, these errors have disrupted user updates and triggered several open issues. We believe it’s a kernel bug. Until the issue is fixed, we recommend users pin to an earlier stemcell (1.340) or increase the staging limit of applications.

What’s the error that users are seeing?

Before running an application on Cloud Foundry, and application must be “staged” (choosing the appropriate buildpack (Ruby, Golang, etc.), compiling, resolving dependencies). When a user runs “cf push” or “cf restage”, it would fail with “Error staging application: StagingError - Staging error: staging failed”. When viewing the /var/log/kern.log on the Diego cell where the app was staged, one would see the error “Memory cgroup out of memory: Killed process …” along with a stack trace.

How Can I Avoid OOMs on my Foundation?

One way is to pin the stemcell to Jammy 1.340 and not upgrade past that. If that’s not possible, bump the staging RAM limit from 1 GiB to 2 GiB or higher, depending on your staging footprint. Specifically, modify dea_next.staging_memory_limit_mb. Current default is 1024; we recommend bumping it to 2048 or 4096.

Will Increasing the Staging RAM Limit Adversely Affect the Foundation?

We doubt increasing the staging RAM limit will have a negative impact unless the user is in the habit of restaging all their applications at the same time. The staging cycle is short-lived, and though staging an app will reserve a greater amount of RAM, that RAM will be released when the staging cycle completes within a few minutes.

What Stemcells are Affected?

The Linux stemcells based on Canonical’s Ubuntu Jammy Jellyfish release are affected, beginning with version 1.351 (released January 29, 2024) to 1.390 (present) are affected. That coincides with Canonical’s introduction of the 6.5 Linux kernel (prior stemcells had the 6.2 Linux kernel).

Which IaaSes are Affected?

We have seen the problem on vSphere, GCP, and AWS, and we suspect they occur on all IaaSes.

What’s Causing the Error?

We believe that the error is caused by a poor interaction between the Linux 6.5 kernel and v1 cgroups. The Linux 6.5 kernel was introduced with the 1.351 stemcell. Specifically, the introduction of Multi-Gen LRU:

Multi-Gen LRU was merged in 6.1, but disabled by default. The Ubuntu 6.2 kernel has it disabled by default.
The Ubuntu 6.5 kernel turns it on by default.
To disable, echo n | sudo tee /sys/kernel/mm/lru_gen/enabled
Catenate that file, it should be 0x0000

Which Applications are Affected?

Golang, and we have reports of NodeJS and Java as well.

What's Being Done to Fix the Error?

We're planning to rollback the kernel 6.5 → 5.15. In the meantime, we're pursuing a fix with Canonical; also, we're thinking of bumping the staging limit.

matthewruffell commented 3 months ago

2/29 We were able to enhance our OOM-replication for Canonical to troubleshoot — on a regular Jammy VM, we were able to OOM almost every time. We accomplished this by lowering the memory limit from 1GiB to 512MiB. This makes it easier for Canonical to debug.

Hi @cunnie, I ran main.go on 5.15, 6.2 and 6.5, and found that under 512MiB, we OOM every time, there is no scenario that this is enough memory to compile main.go. I also made an unbounded cgroup to see how much memory is consumed at the peak, and found it to be somewhere around 1-1.1gb needed.

$ uname -rv
5.15.0-97-generic #107-Ubuntu SMP Wed Feb 7 13:26:48 UTC 2024
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes 
536883200
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes 
1091538944

$ uname -rv
6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes
536920064
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes
1083924480

$ uname -rv
6.5.0-21-generic #21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb  9 13:32:52 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/512mb/memory.max_usage_in_bytes
537112576
$ cat /sys/fs/cgroup/memory/system.slice/unbounded/memory.max_usage_in_bytes
1180794880

Have you managed to compile main.go within a 512MiB limit on any kernel?

Do you have an example workload that used to work under 6.2, but now fails on 6.5?

I am also reviewing all commits in cgroup v1 between 6.2 and 6.5. Support for v1 does indeed exist under jammy, but v2 will always be better tested as it is more widely used, since it has been the new default for a few years now.

Thanks, Matthew

jpalermo commented 3 months ago

Hey @matthewruffell

I was just trying to reproduce again and had some troubles getting it to build on 6.2, which is weird because I'm pretty sure I'm using the exact same steps I was doing the other day.

I did get it to pass on 6.2 and fail on 6.5 when turning the GOMAXPROC down to 4 from 16 which we were using the other day.

$ uname -rv
6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.limit_in_bytes
536870912
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.memsw.limit_in_bytes
536870912
$ GOMAXPROCS=4 ../go/bin/go build -mod vendor -ldflags="-s -w" -a .
$

$ uname -rv
6.5.0-1014-gcp #14~22.04.1-Ubuntu SMP Sat Feb 10 04:57:00 UTC 2024
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.limit_in_bytes
536870912
$ cat /sys/fs/cgroup/memory/system.slice/oom-test/memory.memsw.limit_in_bytes
536870912
$ GOMAXPROCS=4 ../go/bin/go build -mod vendor -ldflags="-s -w" -a .
github.com/jackc/pgtype: /workspace/go/pkg/tool/linux_amd64/compile: signal: killed
github.com/redis/go-redis/v9: /workspace/go/pkg/tool/linux_amd64/compile: signal: killed
$

jpalermo commented 3 months ago

We tested on the 5.15 non-hwe kernel and were not able to reproduce the issue there. So it doesn't appear to be introduced by something that got back-ported to 5.15.

mymasse commented 3 months ago

Is anything going to be done for when we are back on a 6.5 kernel?

jpalermo commented 3 months ago

If it is a cgroups v1 & 6.5 kernel problem, the current plan is to move the noble stemcell to cgroups v2, so it should get impacted there.

It's also likely the problem will get fixed in the 6.5 kernel at some point, it's just a question of when.

jpalermo commented 3 months ago

The candidate stemcells with the 5.15 kernel seem to work just as well as the 6.5 ones. The current plan, unless people find problems, is to publish them on Monday.

The version numbers are obviously not what will be released, that's just the way this part of the pipeline works.

We did some testing against public IaaSs to verify VM types work. We did not do an exhaustive test of all types, but we generally found that VM types that work with the current Jammy seem to work fine with the 5.15 version too.

https://storage.googleapis.com/bosh-core-stemcells-candidate/google/bosh-stemcell-210.892-google-kvm-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/aws/bosh-stemcell-210.892-aws-xen-hvm-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/azure/bosh-stemcell-210.892-azure-hyperv-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/vsphere/bosh-stemcell-210.892-vsphere-esxi-ubuntu-jammy-go_agent.tgz
https://storage.googleapis.com/bosh-core-stemcells-candidate/openstack/bosh-stemcell-210.892-openstack-kvm-ubuntu-jammy-go_agent-raw.tgz

matthewruffell commented 3 months ago

I think the 5.15 is a good workaround for the time being. I am still looking into the 6.5 cgroups v1 issue, but it is tricky as I can sometimes successfully get the reproducer to build correctly in 6.5 with the 512MiB ram limit, making bisecting tricky due to it not being deterministic.

I will write back once I have some more information to share. I will spend today, and early next week looking into this.

Thanks, Matthew

jpalermo commented 3 months ago

Ubuntu-jammy 1.404 has been cut with the 5.15 kernel. Can people experiencing the problem confirm if this resolves it?

matthewruffell commented 3 months ago

Hi @jpalermo @cunnie,

I have been reading all changes between 6.2 and 6.5 related to cgroups and memory control groups.

(in Linus Torvalds linux repository)

$ git log --grep "memcg" v6.2..v6.5 $ git log --grep "memcontrol" v6.2..v6.5 $ git log --grep "mm: memcontrol:" v6.2..v6.5

I came up with the following shortlist of commits which touches memory controls within cgroups, and especially, cache eviction:

One line shortlist (Hash, Subject): https://paste.ubuntu.com/p/dqFPP6kPHr/

Full git log: https://paste.ubuntu.com/p/Hd2jqnw6Qw/

Now, there are about 70 commits of interest, and I can't guarantee any of them are the culprit without testing them against a consistent 100% reproducer, which we don't currently have, so the following is the current theory only.

The major feature that stands out is the major work done on Multi Generational Least Regularly Used (Multi-Gen LRU).

Documentation: https://docs.kernel.org/mm/multigen_lru.html https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html

Explanations: https://lwn.net/Articles/851184/ https://lwn.net/Articles/856931/

News articles: https://www.phoronix.com/news/MGLRU-In-Linux-6.1 https://www.phoronix.com/news/Linux-MGLRU-memcg-LRU

LRU is a caching concept. Least Recently Used. The "old" implementation, used in 5.15 and 6.2, added an "age bit" that gets incremented each time the cache line is used. The lowest bits gets evicted when we experience memory pressure. It is pretty simple, but not very sophisticated.

The Multi-Gen LRU is quite complicated, but the idea is to group pages together into the idea of "generations", taking into account things like spatially (how close the page is to another frequently used page, e.g. 5 pages in a row used for the same thing get grouped together), among other metrics.

Multi-Gen LRU was merged in 6.1, but disabled by default. The Ubuntu 6.2 kernel has it disabled by default. The Ubuntu 6.5 kernel turns it on by default.

$ grep -Rin "LRU_GEN" config-*
config-6.2.0-39-generic:1157:CONFIG_LRU_GEN=y
config-6.2.0-39-generic:1158:# CONFIG_LRU_GEN_ENABLED is not set
config-6.2.0-39-generic:1159:# CONFIG_LRU_GEN_STATS is not set
config-6.5.0-21-generic:1167:CONFIG_LRU_GEN=y
config-6.5.0-21-generic:1168:CONFIG_LRU_GEN_ENABLED=y
config-6.5.0-21-generic:1169:# CONFIG_LRU_GEN_STATS is not set

6.2:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0000

6.5:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0007

0x0000 is off. 0x0007 is fully on, as per the table in:

https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html

Before I go too deep into this rabbit hole, could we please try turning off Multi-Gen LRU, and reverting back to the basic LRU as in 5.15 and 6.2, on the 6.5 kernel?

1) Boot into 6.5 kernel 2) Confirm Multi-Gen LRU on:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0007

3) Disable Multi-Gen LRU:

$ echo n | sudo tee /sys/kernel/mm/lru_gen/enabled
n

4) Check it is off:

$ cat /sys/kernel/mm/lru_gen/enabled
0x0000

5) Make a cgroup and try reproduce the issue with main.go or real-world workload.

Please let me know if it makes any difference. If it does, then we have our culprit, and we can study Multi-Gen LRU more, and if it doesn't, it's back to the drawing board, and we need to get a 100% reproducer running for further analysis.

Thanks, Matthew

jpalermo commented 3 months ago

We did some testing with the Multi-Gen LRU disabled and were unable to reproduce the problem. If anybody is still running the 6.5 kernel stemcells and is able to disable it and see if that resolves the problem that would be great data to have.

You could either run it via a bosh ssh -c against the whole instance group easily, or do it via a pre-start script and os-conf release.

schindlersebastian commented 3 months ago

Hi *, we did some testing with Multi-Gen LRU disabled on all of our diego cells. I can confirm that afterwards the OOM problem was no longer reproducable!

Thanks for digging into it! Sebastian

EDIT: After in-depth testing in all of our stages we saw no occurences of the OOM error with /sys/kernel/mm/lru_gen/enabled set to "n" (0x0000) As soon as we re-enable Multi-Gen LRU the cf pushes start to fail in 50% - 70% of the attempts.

jpalermo commented 3 months ago

Sounds like the ubuntu-jammy 1.404 stemcell has resolved the issue by using the 5.15 kernel. Going to close now, but reopen if people see the issue again.

matthewruffell commented 2 months ago

Hi everyone,

Just wanted to drop by with an update on the current situation.

I was reading the commits to Multi-Gen LRU between 6.5 and 6.8, and came across

commit 669281ee7ef731fb5204df9d948669bf32a5e68d Author: Kalesh Singh kaleshsingh@google.com Date: Tue Aug 1 19:56:02 2023 -0700 Subject: Multi-gen LRU: fix per-zone reclaim Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=669281ee7ef731fb5204df9d948669bf32a5e68d

which was particularly interesting. The symptoms are pretty much the same, and there is a github issue [1] which describes the same sort of issues:

[1] https://github.com/raspberrypi/linux/issues/5395

Unfortunately, this commit is already applied to the 6.5 kernel, and is a part of 6.5.0-9-generic:

$ git log --grep "Multi-gen LRU: fix per-zone reclaim" 27c7d0b93445 Multi-gen LRU: fix per-zone reclaim $ git describe --contains 27c7d0b93445eaadfe46bcdb57dab2090e023c19 Ubuntu-hwe-6.5-6.5.0-9.9_22.04.2~128

You were testing 6.5.0-15-generic at least, so the commit is present. So we are looking for another fix.

I checked all the recent additions of patches into 6.5, and there was actually a lot of Multi-Gen LRU commits very recently into the Ubuntu 6.5 kernel.

$ git log --grep "mglru" --grep "MGLRU" --grep "Multi-gen LRU" Ubuntu-6.5.0-15.15..origin/master-next

commit c28ac3c7eb945fee6e20f47d576af68fdff1392a Author: Yu Zhao yuzhao@google.com Date: Fri Dec 22 21:56:47 2023 -0700 Subject: mm/mglru: skip special VMAs in lru_gen_look_around() Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c28ac3c7eb945fee6e20f47d576af68fdff1392a

commit 4376807bf2d5371c3e00080c972be568c3f8a7d1 Author: Yu Zhao yuzhao@google.com Date: Thu Dec 7 23:14:07 2023 -0700 Subject: mm/mglru: reclaim offlined memcgs harder Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4376807bf2d5371c3e00080c972be568c3f8a7d1

commit 8aa420617918d12d1f5d55030a503c9418e73c2c Author: Yu Zhao yuzhao@google.com Date: Thu Dec 7 23:14:06 2023 -0700 Subject: mm/mglru: respect min_ttl_ms with memcgs Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8aa420617918d12d1f5d55030a503c9418e73c2c

commit 5095a2b23987d3c3c47dd16b3d4080e2733b8bb9 Author: Yu Zhao yuzhao@google.com Date: Thu Dec 7 23:14:05 2023 -0700 Subject: mm/mglru: try to stop at high watermarks Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5095a2b23987d3c3c47dd16b3d4080e2733b8bb9

commit 081488051d28d32569ebb7c7a23572778b2e7d57 Author: Yu Zhao yuzhao@google.com Date: Thu Dec 7 23:14:04 2023 -0700 Subject: mm/mglru: fix underprotected page cache Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=081488051d28d32569ebb7c7a23572778b2e7d57

commit bb5e7f234eacf34b65be67ebb3613e3b8cf11b87 Author: Kalesh Singh kaleshsingh@google.com Date: Tue Aug 1 19:56:03 2023 -0700 Subject: Multi-gen LRU: avoid race in inc_min_seq() Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bb5e7f234eacf34b65be67ebb3613e3b8cf11b87

All but the very last commit are introduced in 6.5.0-27-generic, which was just released to -updates today.

These fixups get us most of the way to all the fixes available in 6.8, and I do wonder if they help things.

Would anyone please be able to install 6.5.0-27-generic to a stemcell, and run a real world workload through it, with Multi-Gen LRU enabled?

I would be very eager to hear the results. These patches should help with cache eviction and not running out memory.

Thanks, Matthew

matthewruffell commented 2 months ago

Hi @jpalermo @cunnie,

Have you had a chance to have a look at 6.5.0-27-generic?

Thanks, Matthew

matthewruffell commented 1 month ago

Hi @jpalermo @cunnie,

Is it okay to assume that memory reclaim is good enough on 6.5.0-27-generic or later? Let me know if you are still interested in 6.5. Noble has been released with its 6.8 kernel now, I hope your cgroups v2 transition is going well.

Thanks, Matthew

cloudfoundry / bosh-linux-stemcell-builder