Open snowzach opened 3 hours ago
No idea if this is related but googling for CPU soft lockup and a few other keywords led me to this issue: https://bbs.archlinux.org/viewtopic.php?id=264127&p=3 that appears to be affected up until Kernel 5.15 (which is what we are running it appears)
Platform I'm building on:
Running a very simple NFS server container on
bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41
Dockerfile:
Entrypoint script:
Essentially I run this on an AWS i3en with local flash provisioned as ephemeral storage shared via this NFS server. It's a high performance cache drive. Testing with
i3en.2xlarge
What I expected to happen:
It would be a super fast NFS server sharing this ephemeral storage.
What actually happened:
I can mount this storage from another
i3en.2xlarge
instance and mostly it works unless we really push it. If I run the disk testing toolbonnie++ -d /the/nfs/share -u nobody
and wait, within a minute or two the machine will start displaying errors in the logs aboutwatchdog: BUG: soft lockup - CPU#7 stuck for 22s!
as well asena 0000:00:06.0 eth2: TX hasn't completed, qid 5, index 801. 26560 msecs since last interrupt, 41910 msecs since last napi execution, napi scheduled: 1
How to reproduce the problem:
Run the container, run bonnie++ on the NFS share.
It's very reliably reproduced.
Attached is a log: bottlerocket-log.txt