Kernel Panic or Lockup on DS3615xs image

RedPill-TTG / redpill-lkm

Linux kernel module for RedPill

GNU General Public License v3.0

308 stars 170 forks source link

Kernel Panic or Lockup on DS3615xs image #21

Open WiteWulf opened 2 years ago

WiteWulf commented 2 years ago

A number of users on the forum are reporting kernel panics on baremetal installs, or lockups of guests on virtualisation platforms, when running the DS3615xs image specifically. This is typically precipitated by running docker in general, or certain docker images, but may also have been caused by high IO load in some circumstances. A common feature is use of databases (notably influxdb, mariadb, mysql and elasticsearch) but also nginx and jdownloader2.

This has been observed on baremetal HP Gen7 and Gen8 servers, proxmox and ESXi with a variety of Xeon CPUs (E3-1265L V2, E3-1270 V2, E3-1241 V3, E3-1220L V2 and E3-1265L V4), Celeron and AMD.

Most users are on DSM7.0.1-RC1, but I also observed this behaviour when on DSM6.2.4

(edit: also confirmed to affect 7.0 beta and 7.0.1, ie. not the release candidate)

Conversely, a number of users with DS918+ images have reported no issues with running docker or known problematic images (in my case influxdb causes a 100% reproducible crash).

On my baremetal HP Gen8 running 6.2.4 I get the following console output before a reboot:

[  191.452302] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3

[  191.487637] CPU: 3 PID: 19775 Comm: containerd-shim Tainted: PF          O 3.10.105 #25556

[  191.528112] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019

[  191.562597]  ffffffff814c904d ffffffff814c8121 0000000000000010 ffff880109ac8d58

[  191.599118]  ffff880109ac8cf0 0000000000000000 0000000000000003 000000000000002c

[  191.634943]  0000000000000003 ffffffff80000001 0000000000000010 ffff880103817c00

[  191.670604] Call Trace:

[  191.682506]  <NMI>  [<ffffffff814c904d>] ? dump_stack+0xc/0x15

[  191.710494]  [<ffffffff814c8121>] ? panic+0xbb/0x1ce

[  191.735108]  [<ffffffff810a0922>] ? watchdog_overflow_callback+0xb2/0xc0

[  191.768203]  [<ffffffff810b152b>] ? __perf_event_overflow+0x8b/0x240

[  191.799789]  [<ffffffff810b02d4>] ? perf_event_update_userpage+0x14/0xf0

[  191.834349]  [<ffffffff81015411>] ? intel_pmu_handle_irq+0x1d1/0x360

[  191.865505]  [<ffffffff81010026>] ? perf_event_nmi_handler+0x26/0x40

[  191.897683]  [<ffffffff81005fa8>] ? do_nmi+0xf8/0x3e0

[  191.922372]  [<ffffffff814cfa53>] ? end_repeat_nmi+0x1e/0x7e

[  191.950899]  <<EOE>> 

[  191.961095] Rebooting in 3 seconds..

This is seen by others on baremetal when using docker. Virtualisation platform users see 100% CPU usage on their xpenology guest and it becomes unresponsive, requiring a restart of the guest. The majority of kernel panics cite containerd-shim as being at fault, but sometimes (rarely) it will list a process being run inside a docker container (notably influxdb in my case).

This is notably similar to an issue logged with RHEL a number of years ago that they note was fixed in a subsequent kernel release: https://access.redhat.com/solutions/1354963

WiteWulf commented 2 years ago

@WiteWulf Did you feed back this bug to the forum? The team doesn't seem to reply to other posts. https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/

It's been covered in the depth on the forum both in the main Redpill thread and it's own dedicated thread. Both easy to find with the search function.

ThorGroup tend to reply to the forum threads at most once a week, there was a big update from them overnight where they addressed this issue many times.

ceozero commented 2 years ago

well. If there is any solution, please @ me.. Thank you.

WiteWulf commented 2 years ago

Sorry, no, that’s not how this works 😀

If you want to keep up to date either “follow” the repo or issue here on GitHub or the thread on the forums. Use the tools available to you…

ttg-public commented 2 years ago

We don't want to say that but... we THINK we nailed it down ;)

It looks like we were accidentally touching some of the registers which weren't preserved by the caller. This is an exception with a few syscalls in Linux (the full stack frame is not preserved for performance reasons). After a numerous articles and a plenty of research we concluded it's not worth fighting for literally 1-3 CPU cycles per binary and go ASM-crazy (as in comparison, access to MEMORY is one-few hundred times slower, so it only matters on a datacenter-scale with millions of executions) and added an unconditional jump to the execve() override. With any modern CPU (we tested as far as one of the first Core2Duos) the impact is immeasurable due to branch prediction ;)

The fact it worked on 918+ seems to be a sheer luck of the GCC used to compile the kernel on 918+ didn't touch these registers (or didn't touch them enough to cause a problem). These "random" soft lockups were caused by the interaction of the swapper process and context switching. That's why docker + a database was a PERFECT triggering scenario as it stresses the memory, uses swap extensively, executes many processes and switches contexts like crazy (which is, as a side note, why database in a container in any performance-oriented applications is really hard to pull off correctly).

There are two tests we were running which consistantly crashed the 3515xs instance on Proxmox:

# Test 1 - installing and removing mysql container; usually crashed 512MB instance within the first try/seconds
while true; do
    sudo docker rm -f test-mysql ;
    sudo docker image prune --all -f ;
    sudo docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest &&
    echo "OK"
done

# Test 2 - on instances with more RAM stress-testing the server usually triggered the crash in under a minute
docker run --name test-mysql -e MYSQL_ROOT_PASSWORD=test -d mysql:latest
docker exec -it test-mysql bash
# execute mysql -u root -p => then execute any SQL like SHOW DATABSES;
# for some reasons running mysqlslap right away ends up in denied connections?
mysqlslap --user=root -p --auto-generate-sql --number-char-cols=100 --number-int-cols=100 --detach=10 --concurrency=200 --iterations=5000

That being said once we debugged the issue the fix wasn't that difficult - the proper stack protection and indirect jump seems to fix the problem. This is really the commit which introduced the mechanism: https://github.com/RedPill-TTG/redpill-lkm/commit/122470d611dbe2b95260b82da67a0da12b6fb86b (it's slightly noisy as we moved one file, the important change is with comment to the override_syscall() and the macros above that function). The fix for execve() interception was a matter of adjusting syntax: https://github.com/RedPill-TTG/redpill-lkm/commit/cdc0e2de9273b13b1f7dcadee2d22796be5a0e02

@WiteWulf: your reports were very useful. Can you poke around and see if the new version fixes the problem? We tested it in multiple scenario and weren't able to crash it but....

p.s. If anyone wants to get nerdy about performance impacts CloudFlare has a nice blog post about a similar problem: https://blog.cloudflare.com/branch-predictor/

pocopico commented 2 years ago

@ttg-public @WiteWulf

Yes ! At least now universal search (synoelasticd) doesn't crash the system !

This was the most consistent crash and i was able to replicate in both physical and virtual machines, which in my case varied from AMD to Intel CPUs, old and newer CPUs etc.

I will continue testing

OrpheeGT commented 2 years ago

To be disregarded, did not git pull... my bad.

WiteWulf commented 2 years ago

Well, it's definitely an improvement, but not fixed. This is just a "feeling", but I can get more containers running than I was on 3474d9b before it crashes.

I had all my docker containers running, and was importing a load of data to the influxdb when it crashed:

[  393.963862] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[  393.998533] CPU: 1 PID: 23812 Comm: python3 Tainted: PF          O 3.10.108 #42218
[  394.035043] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  394.069815]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[  394.106286]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[  394.143049]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[  394.179034] Call Trace:
[  394.190684]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  394.219155]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  394.243880]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  394.276362]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  394.307194]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  394.339880]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  394.371296]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  394.402719]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  394.426839]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  394.454698]  <<EOE>> 
[  394.465349] Rebooting in 3 seconds..

Notably it wasn't the influxdb or containerd-shim processes that it flagged this time, though. It was the python script I was running to gather the data and import to the influxdb.

Plex Media Server next...

Nope:

[  453.200262] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
[  453.235373] CPU: 3 PID: 23981 Comm: Plex Media Serv Tainted: PF          O 3.10.108 #42218
[  453.275990] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  453.310620]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409ac8d60
[  453.348126]  ffff880409ac8cf8 0000000000000000 0000000000000003 0000000000000001
[  453.384619]  0000000000000003 ffffffff80000001 0000000000000030 ffff8803f5372c00
[  453.420535] Call Trace:
[  453.432447]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  453.461377]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  453.486080]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  453.519279]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  453.549937]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  453.582923]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  453.614385]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  453.646830]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  453.671481]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  453.699295]  <<EOE>> 
[  453.709435] Rebooting in 3 seconds..

After the reboot I just left Plex running in the background and it went again without me even pushing it with any updates/scans.

pocopico commented 2 years ago

@WiteWulf @ttg-public

I left the Test#1 running and had no crash until i reached the pull limit ! Thats definetely an improvement. I need though to test it on a slower CPU also.

OrpheeGT commented 2 years ago

So I rebuilt my image, I did wrong first time. with a git pull done Synology Moments seems to be OK now when adding like 40 pictures in a row...

Edit : inserted 185 in a row :

It imported them correctly, but after that it froze a bit... CPU 100% :

But no soft lockup

Serial console logs :

[ 28.496812] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 28.904537] <redpill/bios_shims_collection.c:43> mfgBIOS: nullify zero-int for VTK_SET_HDD_ACT_LED [ 28.908402] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.912443] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.916223] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.919735] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.923155] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.926507] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.929672] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.932944] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.937190] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.941205] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.945578] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.949115] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.952771] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.958468] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 28.962119] <redpill/bios_shims_collection.c:35> mfgBIOS: nullify zero-int for VTK_SET_DISK_LED [ 29.007892] <redpill/pmu_shim.c:310> Got 2 bytes from PMU: reason=1 hex={2d 38} ascii="-8" [ 29.009316] <redpill/pmu_shim.c:239> Executing cmd OUT_STATUS_LED_ON_GREEN handler cmd_shim_noop+0x0/0x30 [redpill] [ 29.011055] <redpill/pmu_shim.c:45> vPMU received OUT_STATUS_LED_ON_GREEN using 1 bytes - NOOP [ 31.316069] Synotify use 16384 event queue size [ 31.316901] Synotify use 16384 event queue size [ 31.800159] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 33.260153] Synotify use 16384 event queue size [ 33.260999] Synotify use 16384 event queue size [ 35.389368] init: synocontentextractd main process ended, respawning [ 40.617909] Synotify use 16384 event queue size [ 77.034590] type=1400 audit(1633172764.561:2): apparmor="DENIED" operation="capable" parent=1 profile="/volume/@appstore/SynologyMoments/usr/sbin/synophoto-thumbnail-provider" pid=13817 comm="synophoto-thumb" pid=13817 comm="synophoto-thumb" capability=36 capname="block_suspend" [ 77.182689] type=1400 audit(1633172764.709:3): apparmor="DENIED" operation="capable" parent=1 profile="/volume/@appstore/SynologyMoments/usr/sbin/synophoto-thumbnail-provider" pid=13817 comm="synophoto-thumb" pid=13817 comm="synophoto-thumb" capability=36 capname="block_suspend" [ 83.321087] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 83.322463] <redpill/smart_shim.c:388> Generating fake SMART values [ 143.276691] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 143.278081] <redpill/smart_shim.c:388> Generating fake SMART values [ 203.228901] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 203.230224] <redpill/smart_shim.c:388> Generating fake SMART values [ 263.183224] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 263.184594] <redpill/smart_shim.c:388> Generating fake SMART values [ 278.841375] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 323.138300] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 323.139673] <redpill/smart_shim.c:388> Generating fake SMART values [ 327.134839] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 327.136215] <redpill/smart_shim.c:388> Generating fake SMART values [ 383.092309] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 383.093719] <redpill/smart_shim.c:388> Generating fake SMART values [ 443.050291] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 443.051679] <redpill/smart_shim.c:388> Generating fake SMART values [ 463.522386] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 463.790425] <redpill/override_symbol.c:242> Obtaining lock for [ 463.791748] <redpill/override_symbol.c:242> Writing original code to [ 463.793135] <redpill/override_symbol.c:242> Released lock for [ 463.794407] <redpill/override_symbol.c:217> Obtaining lock for [ 463.795694] <redpill/override_symbol.c:217> Writing trampoline code to [ 463.797086] <redpill/override_symbol.c:217> Released lock for [ 463.798337] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0] [ 502.999601] <redpill/smart_shim.c:644> Got SMART command - looking for feature=0xd0 [ 503.000962] <redpill/smart_shim.c:388> Generating fake SMART values

OrpheeGT commented 2 years ago

While uploading 800 pictures :

9773.624456] <redpill/bios_shims_collection.c:44> mfgBIOS: nullify zero-int for VTK_GET_MICROP_ID [ 9774.007617] <redpill/override_symbol.c:242> Obtaining lock for [ 9774.008964] <redpill/override_symbol.c:242> Writing original code to [ 9774.010353] <redpill/override_symbol.c:242> Released lock for [ 9774.011649] <redpill/override_symbol.c:217> Obtaining lock for [ 9774.012948] <redpill/override_symbol.c:217> Writing trampoline code to [ 9774.014346] <redpill/override_symbol.c:217> Released lock for [ 9774.015665] <redpill/bios_hwcap_shim.c:66> proxying GetHwCapability(id=3)->support => real=1 [org_fout=0, ovs_fout=0]

WiteWulf commented 2 years ago

@OrpheeGT feeling brave enough to upgrade to 7.0.1? :)

ceozero commented 2 years ago

I just recompiled it.. After several hours of testing, it didn't crash. Forced restart or power failure can accelerate the occurrence of crash.. Previously, it would crash as long as it was forced to restart 2-3 times.. I've tested it dozens of times and it doesn't seem to crash. @WiteWulf

ttg-public commented 2 years ago

I had all my docker containers running, and was importing a load of data to the influxdb when it crashed:

[  393.963862] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[  393.998533] CPU: 1 PID: 23812 Comm: python3 Tainted: PF          O 3.10.108 #42218
[  394.035043] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  394.069815]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[  394.106286]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[  394.143049]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[  394.179034] Call Trace:
[  394.190684]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  394.219155]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  394.243880]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  394.276362]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  394.307194]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  394.339880]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  394.371296]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  394.402719]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  394.426839]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  394.454698]  <<EOE>> 
[  394.465349] Rebooting in 3 seconds..

Notably it wasn't the influxdb or containerd-shim processes that it flagged this time, though. It was the python script I was running to gather the data and import to the influxdb.

@WiteWulf: Hard lockups are way different than soft ones. They happen when the interrupts itself weren't processed for some reason for ~10s. We only touch IRQs in a very limited set of scenarios. We tried to isolate test case here as plex will have a huge variability so testing with it may be very hard. Since you reported influx having issues we turned into synthetic influx stress test ran like so:

docker run --name influx-test -d -p 8086:8086 -v $PWD:/var/lib/influxdb influxdb:1.8
docker exec -it influx-test sh

# inside of the container:
wget https://golang.org/dl/go1.17.1.linux-amd64.tar.gz &&
    tar -C /usr/local -xzf go1.17.1.linux-amd64.tar.gz &&
    rm go1* &&
    export PATH=$PATH:/usr/local/go/bin &&
    go get -v 'github.com/influxdata/influx-stress/cmd/...'
/root/go/bin/influx-stress insert -f --stats

Indeed this causes a lot of CPU usage and plenty of swapping, along with sparse hrtimer: interrupt took X ns in dmesg (which by itself usually means that system is overloaded, but stable). However, we weren't able to trigger hard lockup crash. The issue with hard lockups is that the watchdog may actually be wrong and crash the system even if it's not locked up in an infinite loop/lock but just too slow to unlock. It is unusual to happen and shouldn't really.

Can you try increasing the threshold (echo 30 > /proc/sys/kernel/watchdog_threshold) and see if it's still crashing?
If that doesn't help can you try to boot with nmi_watchdog=0 kernel option set? This will disable lockup detection (which is not what you ever want to do normally). Then try to run your workload.
- In such state can you check if the system is really locking up?
- It may be that if you finish importing data and processing Plex cache the situation normalizes (i.e. the CPU usage drops to idle and the system is working like it should)
- If the system locks in such state (i.e. it's not accessible) it may be a real lockup.
Can you try crashing your system with the stress test above? We weren't able but maybe you will? We didn't try this on any of our micros gen8 but only on VM.

We found one more place which can technically cause an infinite loop under very specific circumstances but that shouldn't cause hard lockup crash even if it occurs (see https://github.com/RedPill-TTG/redpill-lkm/commit/0065c89e109542ebb08f124dcac9184665863276)

@OrpheeGT: your results look normal. The CPU usage peaks due to moments processing. The reported locks/unlocks are actually our debug messages to diagnose if something crashes within the lock - if you see lock => unlock sequence it's good, it would be a problem if it was a lock with no corresponding unlock. Also, we see you removed the addresses (ffff.....) while posting a log snippet here - you don't have to. They have no identifying information in them (these are addresses in memory which are semi-random every reboot).

WiteWulf commented 2 years ago

Okay, everyone else has gone to bed so I've got time to tinker :)

I ran part 1 of the test, running the influx-test container with the stress test data, and it crashed within seconds:

[41152.354507] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
[41152.389469] CPU: 1 PID: 5121 Comm: influxd Tainted: PF          O 3.10.108 #42218
[41152.425738] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[41152.460622]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a48d60
[41152.498165]  ffff880409a48cf8 0000000000000000 0000000000000001 0000000000000001
[41152.535348]  0000000000000001 ffffffff80000001 0000000000000030 ffff8803f52ddc00
[41152.571634] Call Trace:
[41152.583761]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[41152.612922]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[41152.637273]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[41152.670285]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[41152.701594]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[41152.734834]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[41152.766158]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[41152.797370]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[41152.822551]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[41152.849821]  <<EOE>> 
[41152.859964] Rebooting in 3 seconds..

NB. due to this crash all subsequent tests were carried out with data-scrubbing running in the background and a consequent CPU and storage load imposed on the system.

Moving on, I tried setting echo 30 > /proc/sys/kernel/watchdog_threshold but /proc/sys/kernel/watchdog_threshold doesn't exist on my system, however /proc/sys/kernel/watchdog_thresh does. I don't know if this was a typo or a genuine difference between our systems.

I ran echo 30 > /proc/sys/kernel/watchdog_thresh instead and tried the stress test again...

Watching the output from top on the host while I ran it I saw that influxd was running at around 500% CPU until it panic'd again after running for less than a minute:

[  712.721022] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[  712.756529] CPU: 0 PID: 18136 Comm: influxd Tainted: PF          O 3.10.108 #42218
[  712.793911] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[  712.828355]  ffffffff814a2759 ffffffff814a16b1 0000000000000010 ffff880409a08d60
[  712.864657]  ffff880409a08cf8 0000000000000000 0000000000000000 0000000000000001
[  712.900318]  0000000000000000 ffffffff80000001 0000000000000030 ffff8803f521d000
[  712.936297] Call Trace:
[  712.948230]  <NMI>  [<ffffffff814a2759>] ? dump_stack+0xc/0x15
[  712.976803]  [<ffffffff814a16b1>] ? panic+0xbb/0x1df
[  713.001207]  [<ffffffff810a9eb8>] ? watchdog_overflow_callback+0xa8/0xb0
[  713.034067]  [<ffffffff810db7d3>] ? __perf_event_overflow+0x93/0x230
[  713.065591]  [<ffffffff810da612>] ? perf_event_update_userpage+0x12/0xf0
[  713.098068]  [<ffffffff810152a4>] ? intel_pmu_handle_irq+0x1b4/0x340
[  713.129805]  [<ffffffff814a9d06>] ? perf_event_nmi_handler+0x26/0x40
[  713.161300]  [<ffffffff814a944e>] ? do_nmi+0xfe/0x440
[  713.185952]  [<ffffffff814a8a53>] ? end_repeat_nmi+0x1e/0x7e
[  713.213759]  <<EOE>> 
[  713.224341] Rebooting in 3 seconds..

Part 2, booting with nmi_watchdog=0. I ran the stress test and experienced no crashes. I left it running for about ten minutes with high CPU load and no crashes:

I then fired up my influxdb container that's usually guaranteed to crash. It showed high cpu for a few seconds after launch but didn't crash. I then ran the big import task that crashed it last time and it completed successfully with no crashes.

I then started up my other containers (domoticz, librenms, mysql, Grafana, ubooquity) with no crashes.

Finally I launched Plex Media Server, with no crash, and initiated a number of library updates and media analyses:

It did not crash during any of these tests with the NMI watchdog disabled, and the server remained responsive and usable throughout. I was watching the output of htop in an SSH session (which updated regularly) and navigating around the web UI.

Despite the obvious risk of running a system without the NMI watchdog running I'm going to leave it like this overnight and see how it goes.

I think we might be getting somewhere...

OrpheeGT commented 2 years ago

@OrpheeGT: your results look normal. The CPU usage peaks due to moments processing. The reported locks/unlocks are actually our debug messages to diagnose if something crashes within the lock - if you see lock => unlock sequence it's good, it would be a problem if it was a lock with no corresponding unlock. Also, we see you removed the addresses (ffff.....) while posting a log snippet here - you don't have to. They have no identifying information in them (these are addresses in memory which are semi-random every reboot).

I did not hide anything actually. Maybe there are not shown from telnet serial console ?

Most if the time I watch live telnet serial console instead of /var/log/messages

WiteWulf commented 2 years ago

@ttg-public on the forums you posted A different hard lockup seems to exist still but this shouldn't affect most of the people and judging by the last post in the issue it may not even be a bug but an over-reactive kernel protection mechanism.

It's significant that I never had this on 6.2.3 with Jun's bootloader, but have seen it on 6.2.4, 7.0 and 7.0.1 since moving to redpill. I appreciate that it may be an issue introduced by Synology between 6.2.3 and 6.2.4, but imho it's more likely an interaction between redpill, the kernel and nmi watchdog.

One way to test this would be to have redpill load 6.2.3 and see if the problem persists. I don't know how much work would be involved in creating a 6.2.3 build target but it would potentially isolate the problem to either redpill or a change introduced by Synology in 6.2.4.

FWIW, I've been running this system with nmi_watchdog=0 since Saturday night with nothing obviously going wrong.

OrpheeGT commented 2 years ago

May I ask do you know why it does not happen with Jun's loader (soft and hard lockups) ? were you able to catch it ?

WiteWulf commented 2 years ago

This looks interesting: https://github.com/RedPill-TTG/redpill-lkm/commit/19f745a2ad1e892e4865c4d3fbf2fba0bfb2c090

WiteWulf commented 2 years ago

I just built a new DS3615xs 7.0.1-42218 image using the latest code commit (0df1ec8) and booted with nmi_watchdog running. The system survived the docker containers starting, but as soon as Plex started updating stuff it kernel panic'ed.

I've rebooted with 'nmi_watchdog=0' set and it's stable again.

WiteWulf commented 2 years ago

Hey folks, big development over on the forums: https://xpenology.com/forum/topic/45795-redpill-the-new-loader-for-624-discussion/?do=findComment&comment=220150

tl;dr disabling my NC360T PCIe NIC and instead adding the tg3 extension and using the onboard NIC massively improves system stability.

I never thought to ask others suffering with kernel panics if they were also using a PCIe NIC. I'll gather info and report back.

WiteWulf commented 2 years ago

Hmmm, not looking so good in the cold light of day :) Multiple reboots overnight when Plex started doing library updates and media analysis. I've asked Kouill to share their boot image with me so I can test on my hardware. I'll report back

WiteWulf commented 2 years ago

@ttg-public

I've found a strange behaviour with the NMI watchdog while rebooting and testing.

I currently have nmi_watchdog=0 in my grub.conf to keep the system stable. As demonstrated previously the system does not kernel panic under load with this.

I wanted to test a new bootloader and accidentally left nmi_watchdog=0 in the json file, so when the system booted it wouldn't crash. I spotted this and did 'echo 1 > /proc/sys/kernel/nmi_watchdog'. The kernel confirmed this on the console with 'NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.' but subsequently running the influx-test it does not crash.

This is reproducible: booting with NMI watchdog enabled results in the previously demonstrated kernel panics, but booting with NMI watchdog disabled then enabling it once the system has finished booting leaves it stable.

Lasx commented 2 years ago

@ttg-public

I've found a strange behaviour with the NMI watchdog while rebooting and testing.

I currently have nmi_watchdog=0 in my grub.conf to keep the system stable. As demonstrated previously the system does not kernel panic under load with this.

I wanted to test a new bootloader and accidentally left nmi_watchdog=0 in the json file, so when the system booted it wouldn't crash. I spotted this and did 'echo 1 > /proc/sys/kernel/nmi_watchdog'. The kernel confirmed this on the console with 'NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.' but subsequently running the influx-test it does not crash.

This is reproducible: booting with NMI watchdog enabled results in the previously demonstrated kernel panics, but booting with NMI watchdog disabled then enabling it once the system has finished booting leaves it stable.

Has the problem been solved? Can I upgrade?

WiteWulf commented 2 years ago

I would say this was a workaround rather than a solution. As with all things related to redpill, it comes with a risk of dataloss: don't use it with data you don't have backed up elsewhere or don't care about losing.