Seagate / openSeaChest

Cross platform utilities useful for performing various operations on SATA, SAS, NVMe, and USB storage devices.
Other
514 stars 64 forks source link

All Seagate ST20000NM007D drives ignoring PowerChoice (EPC) timers #111

Closed Deltik closed 1 year ago

Deltik commented 1 year ago

I have six Seagate ST20000NM007D (Exos X20, 20TB) hard drives that are configured with Extended Power Conditions (EPC, also known as Seagate PowerChoice) timers, but the drives never move from the active power mode to any of the idle or standby power modes.  There is no I/O on these drives, yet every minute, there is a brief seeking sound from all of them.

It's not host activity that is constantly waking up these drives because I can manually change the power mode to standby_z with openSeaChest_PowerControl -d /dev/{} --transitionPower standby_z (or to any other power modes) and they will stay in this state until an I/O operation wakes them up again.

My SATA host bus adapter (HBA) is in the passthrough "initator target" (IT) mode, so it is not doing anything extra like periodic S.M.A.R.T. checks or patrol reads.  Even if it were, that wouldn't explain why manually changing the power mode keeps the drives in the selected mode.

smartd from smartmontools does do periodic S.M.A.R.T. checks, but I have it disabled for testing. I also verified that the cron daemon and systemd timers aren't running anything at the time the seeking sounds happen.

To be certain that there was no I/O of any kind, I even set up SCSI debug logging with sg3-utils:

scsi_logging_level -s -a 7

There was absolutely nothing logged at the time the drives were making the seeking sounds as if they had a mind of their own.

Here are how the EPC timers are configured on all of the disks (one shown for brevity):

# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdg --showEPCSettings
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Fri Apr 21 12:54:25 2023        User: root
==========================================================================================

/dev/sg6 - ST20000NM007D-3DJ103 - ZVT5K84J - SN01 - ATA

===EPC Settings===
        * = timer is enabled
        C column = Changeable
        S column = Savable
        All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A     *1            *1            *1            1             Y Y
Idle B     *1200         *1200         *1200         4             Y Y
Idle C     *1800          6000         *1800         20            Y Y
Standby Z  *3000          9000         *3000         110           Y Y

Yet the power mode never changes on its own:

# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdg --checkPowerMode
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Fri Apr 21 12:57:10 2023        User: root
==========================================================================================

/dev/sg6 - ST20000NM007D-3DJ103 - ZVT5K84J - SN01 - ATA
Device is in the PM0: Active state or PM1: Idle State

EPC is enabled, which I confirmed with this command:

# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdg --deviceInfo | grep -i epc
                EPC [Enabled]

It seems to me that these Seagate Exos X20 drives with firmware SN01 are ignoring the EPC timer and not transitioning themselves to lower power modes.  Why might this be happening, and how can I get the drives to go into idle and standby by themselves?

vonericsen commented 1 year ago

Hi @Deltik,

Can you confirm that if you do openSeaChest_PowerControl -d <handle> --transitionPower standby followed by openSeaChest_PowerControl -d <handle> --checkPowerMode this is outputting the correct text to say the drive is in standby (or standby_z)? You may need to wait about 15 seconds after the transitionPower before running the check power mode.

Deltik commented 1 year ago

Sure, @vonericsen:

root@ubuntu:~# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdc --checkPowerMode ; /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdc --transitionPower standby ; /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdc --checkPowerMode ; sleep 15 ; /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdc --checkPowerMode
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 19:17:31 2023        User: root
==========================================================================================

/dev/sg2 - ST20000NM007D-3DJ103 - ZVT5YJL5 - SN01 - ATA
Device is in the PM0: Active state or PM1: Idle State

==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 19:17:31 2023        User: root
==========================================================================================

/dev/sg2 - ST20000NM007D-3DJ103 - ZVT5YJL5 - SN01 - ATA

Power Mode Transition Successful.
Please give device a few seconds to transition.

Hint:Use --checkPowerMode option to check the new Power Mode.

==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 19:17:32 2023        User: root
==========================================================================================

/dev/sg2 - ST20000NM007D-3DJ103 - ZVT5YJL5 - SN01 - ATA
Device is in the PM2: Standby state and device is in the Standby_z power condition

==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 19:17:47 2023        User: root
==========================================================================================

/dev/sg2 - ST20000NM007D-3DJ103 - ZVT5YJL5 - SN01 - ATA
Device is in the PM2: Standby state and device is in the Standby_z power condition

The drive goes into standby_z mode and stays there if I manually run openSeaChest_PowerControl --transitionPower or even hdparm -y. It just doesn't move to the lower power states according to the configured EPC timers as if EPC were disabled:

root@ubuntu:~# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/sdc --showEPCSettings
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 19:20:29 2023        User: root
==========================================================================================

/dev/sg2 - ST20000NM007D-3DJ103 - ZVT5YJL5 - SN01 - ATA

===EPC Settings===
        * = timer is enabled
        C column = Changeable
        S column = Savable
        All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A     *1            *1            *1            1             Y Y
Idle B     *1200         *1200         *1200         4             Y Y
Idle C     *1800          6000         *1800         20            Y Y
Standby Z  *3000          9000         *3000         110           Y Y
vonericsen commented 1 year ago

@Deltik,

Thank you! Just wanted to confirm there was not a software bug checking the power mode; it does not look like it.

If you wait for the amount of time set for the default timer (10 minutes for idle-c or 15 minutes for standby_z), does the drive enter these other power states?

One last question, what OS and kernel are you running? The openSeaChest_PowerControl --version should dump this information for you,

We will see if we can repeat this issue as well from our end to see if we can figure out what is happening and want to make sure we have as similar of a configuration as possible.

Deltik commented 1 year ago

Thanks for taking a look!

The drives never transition the power mode themselves. I've even left them idling overnight, yet they remain active. Unusually, there are periodic seeking noises from the drives, but the periods seem to be random after each time I wake the drives. Sometimes, they happen once a minute, other times twice a second. Despite the noises, there is no disk activity on the host.

Here is the version information you requested:

# /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl --version
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Mon Apr 24 16:20:40 2023        User: root
==========================================================================================
Version Info for openSeaChest_PowerControl:
        Utility Version: 3.3.1
        opensea-common Version: 1.23.0
        opensea-transport Version: 4.1.1
        opensea-operations Version: 4.5.2
        Build Date: Mar 27 2023
        Compiled Architecture: X86_64
        Detected Endianness: Little Endian
        Compiler Used: GCC
        Compiler Version: 11.2.1
        Operating System Type: Linux
        Operating System Version: 5.19.0-40
        Operating System Name: Ubuntu 22.04.2 LTS
Deltik commented 1 year ago

After some painstaking investigation (hindered by not being an expert at Linux internals), I narrowed down the problem to the host bus adapter (HBA) to which the Seagate Exos X20 hard drives are connected.

This particular HBA is a Dell PERC H310, and it's a piece of 💩, as I found out almost 7 years ago. After 3 more years of putting up with the bad performance, I learned that I could cross-flash it into an LSI SAS 9211-8i in IT ("Initiator Target") mode. This mode passes through the hard drives to the host, and I thought it was great for another 4 years until I got these Seagate Exos X20 hard drives.

Now, I'm getting bizarre behavior that I can't explain…

Symptoms

There are two new problems on my host that are related, but I have not been able to figure out the cause:

Boot failure due to systemd-networkd timing out

When the six Seagate Exos X20 hard drives are plugged in, I cannot achieve a successful boot of my Ubuntu 22.04 system.

The zfs-import-cache.service service runs, which imports the ZFS pool that is on the hard drives. During this time, there is a burst of random I/O for several seconds as the pool is imported.

After this, systemd-networkd.service and systemd-timesyncd.service keep trying to activate in a 90-second loop but never come up.

"A start job is running for Network Configuration" "Failed to start Network Configuration."

Weirdly, if I disable the ZFS pool import either by removing the pool from the zfs-import-cache.service cache (with zpool set cachefile=none …) or unplugging enough hard drives to prevent the import, I am able to boot successfully. After booting, I can then import the pool with no problem. It's almost as if that initial burst of I/O somehow broke the operating system.

I can read everything on the drives just fine. No processes have blocked I/O, and there are no errors in dmesg -T, EDAC, or the IPMI event log, except…

openSeaChest segmentation fault

The only concrete sign that there's something wrong with the operating system is a new segmentation fault that only happens with openSeaChest!

When I run this command:

root@box51 [~]# lsblk | awk '/^sd[bcdefg]/ {print $1}' | xargs -P1 -I{} /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /dev/{} --checkPowerMode
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Tue Apr 25 20:50:42 2023        User: root
==========================================================================================
xargs: /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl: terminated by signal 11

The following error lines appear in dmesg -T:

[Tue Apr 25 20:50:42 2023] openSeaChest_Po[357985]: segfault at 8c ip 00007f6d4ee32314 sp 00007ffc97d094c0 error 4 in openSeaChest_PowerControl[7f6d4edd1000+8f000]
[Tue Apr 25 20:50:42 2023] Code: 0f 95 c0 48 83 c4 18 0f b6 c0 5b 41 5c f7 d8 c3 41 57 49 89 cf 41 56 41 55 41 54 45 31 e4 55 48 89 f5 53 48 89 fb 48 83 ec 38 <8b> 81 8c 00 00 00 89 54 24 14 85 c0 78 0b 48 89 cf e8 53 fb ff ff

I've attached the core dump, core.24171.gz, for reference. Here's the backtrace:

root@box51 [~]# gdb /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl ~/core.24171
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl...
(No debugging symbols found in /tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl)

warning: core file may not match specified executable file.
[New LWP 24171]
Core was generated by `/tmp/openSeaChest-v23.03.1-linux-x86_64-portable/openSeaChest_PowerControl -d /'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd8dc525314 in getdelim ()
(gdb) bt
#0  0x00007fd8dc525314 in getdelim ()
#1  0x0000000000000000 in ?? ()

If I boot up without the Seagate Exos X20 hard drives then later attach them, I can use openSeaChest normally (without segfaults). The PowerChoice (EPC) timers I had trouble with at the beginning even work as scheduled, with the drives moving themselves to lower power modes right on time.


Some kind of activity that I haven't identified yet―possibly some mildly intensive I/O―puts the drives into a state where they're not doing EPC anymore.

I want to blame the LSI HBA because I bought 4 more Seagate Exos X20 hard drives, put them in another server that doesn't have an HBA (direct SATA to motherboard), and saw that their EPC timers were working as advertised with no fuss.

The lack of I/O errors, /lib/systemd/systemd-networkd hanging forever while booting, and the openSeaChest segmentation faults are really puzzling me. I'm not skilled enough at Linux to troubleshoot this further and hope that someone can provide me guidance.

Deltik commented 1 year ago

To make the core dump more useful, I compiled openSeaChest in debug mode (meson --buildtype=debug builddir; ninja -C builddir) and got a new core dump with this backtrace:

root@box51 [/tmp/openSeaChest]# gdb ./builddir/openSeaChest_Basics core.110783
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./builddir/openSeaChest_Basics...
[New LWP 110783]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./builddir/openSeaChest_Basics -d /dev/sda --deviceInfo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f911eead388 in __GI___fgets_unlocked (buf=buf@entry=0x555d0871e338 "", n=n@entry=4096, fp=fp@entry=0x0) at ./libio/iofgets_u.c:50
50      ./libio/iofgets_u.c: No such file or directory.
(gdb) bt
#0  0x00007f911eead388 in __GI___fgets_unlocked (buf=buf@entry=0x555d0871e338 "", n=n@entry=4096, fp=fp@entry=0x0) at ./libio/iofgets_u.c:50
#1  0x00007f911ef3f88e in get_mnt_entry (stream=stream@entry=0x0, mp=mp@entry=0x555d0871e310, buffer=buffer@entry=0x555d0871e338 "", bufsiz=bufsiz@entry=4096) at ./misc/mntent_r.c:126
#2  0x00007f911ef3fd52 in __GI___getmntent_r (stream=0x0, mp=0x555d0871e310, buffer=0x555d0871e338 "", bufsiz=4096) at ./misc/mntent_r.c:191
#3  0x0000555d07efdbb2 in get_Partition_Count (blockDeviceName=0x555d0871bcb2 "/dev/sda") at ../subprojects/opensea-transport/src/sg_helper.c:246
#4  0x0000555d07f004ac in set_Device_Partition_Info (device=0x555d0871bb80) at ../subprojects/opensea-transport/src/sg_helper.c:1070
#5  0x0000555d07f00d95 in get_Device (filename=0x555d087122c0 "/dev/sda", device=0x555d0871bb80) at ../subprojects/opensea-transport/src/sg_helper.c:1295
#6  0x0000555d07ed8038 in main (argc=4, argv=0x7ffffd3f4ef8) at ../utils/C/openSeaChest/openSeaChest_Basics.c:987
Deltik commented 1 year ago

Looking at the code, openSeaChest tries to read /etc/mtab, but I have this peculiar issue where /etc/mtab can't be dereferenced by the kernel but if I manually follow the symlink, I can get the mounts:

root@box51 [~]# head /etc/mtab
head: cannot open '/etc/mtab' for reading: No such file or directory
root@box51 [~]# readlink /etc/mtab
../proc/self/mounts
root@box51 [~]# head /proc/self/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=49359908k,nr_inodes=12339977,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=9880868k,mode=755,inode64 0 0
rpool/ROOT/os / zfs rw,relatime,xattr,posixacl 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

None of my other bare metal machines, virtual machines, and containers have this issue of the unreadable /etc/mtab. I keep getting more confused…

vonericsen commented 1 year ago

@Deltik,

Thanks for more debugging info on this!

So if I am understanding correctly, the issue is somehow related to the HBA doing....something we don't know about right now. And the drives do seem to follow the EPC timers when moved to a motherboard SATA port on another machine. Is this correct?

As for reading the mntents, this is not "required" by openSeaChest, it's just helpful to know for certain operations if a drive has a mounted file system or not to clean some things up and was added to work around a odd bug with Ubuntu disk manager and EXT4 partitions. (If you erase a EXT4 partitioned drive without unmounting and rescanning it, something is still registered in some file somewhere that the drive has a FS...even if it does not due to erasure. Even a reboot does not seem to solve it, so we added unmounting and rescanning the drive to stop this issue from happening).

Looking at the code, I think it is possible to get around the segfault by adding an additional check to make sure the file pointer that is opened by setmntent is good before doing anything else in that function. This won't solve all the other odd behavior, but I think it will at least get you past this segfault.

I have had some experience with odd HBA behavior, but nothing quite like this. I will think about what else may help you debug this situation further while I work on this code change.

vonericsen commented 1 year ago

I have pushed a couple changes that should stop the segfaul and should have been in the code in the first place. If you do git pull --rebase then build again using meson you should no longer see the segfault when it runs.

For the HBA issue, I have worked with a lot of them, sometimes doing weird things. Most often I have found that checking for an updated firmware usually resolves any odd behavior. So if you have not already done this, maybe check if there is a newer firmware for the HBA.

Another thing you might want to do is check if the HDDs are seeing any strange bus activity. There are some things you can check from openSeaChest:

  1. openSeaChest_SMART -d <handle> --smartAttributes analyzed and check if the values in attributes 183 and 199 are going up over time
  2. openSeaChest_SMART -d <handle> --deviceStatistics has some transport statistics that should match these and report number of resets and number of CRC errors

openSeaChest does not currently dump the SATA phy event counters log, but smartctl can do this. This would also be helpful to track over time. It reports similar CRC errors and other statistics, but it's a little more cryptic than the device statistics.

If you see these counters increasing, it is possible that you have a bad cable. Not bad enough to not work at all and stop the drive from showing up, but bad enough to cause some odd behavior. When there are lots of CRC errors and resets, this will cause more retries and might be causing some of these weird things to happen.

openSeaChest_GenericTests -d <handle> --bufferTest attempts to determine if it can detect a bad cable in this test, but it only works sometimes. If it reports any miscompares or CRC errors, then you most likely have a bad cable, but sometimes the test will report that nothing was found 4 times in a row, then the 5th time it finds an error. The cable was still bad the first few times it ran but it did not generate the errors that were detectable those first few times the test was run. I've been trying to figure out a more reliable way to make this test better since it is far from perfect, but you can give this a try as well. It is a data-safe test.

If any of these work, let me know since I have not seen behavior like this before. I'll keep thinking of what else you can try to debug this further as well.

Deltik commented 1 year ago

I've managed to solve both problems, which turned out to be unrelated. Summary:


Hard drives not going to sleep according to EPC timers

I took @vonericsen's advice and looked for a firmware update for my Dell PERC H310-turned-LSI SAS 9211-8i, and there was. I was using the P16 firmware, and after I installed the latest firmware, P20, the Seagate Exos X20 EPC timers started working. To anyone following my steps, make sure to use the "IT" firmware, not the "IR" firmware, or use this easy-to-follow guide to update to the latest "IT" firmware.

The P17, P18, P19, and P20 release notes don't make it clear which defect was involved with keeping the drives awake, but I have not been able to reproduce the repeated seeking noises since upgrading the firmware.

Warning Do not enable Power-Up in Standby (PUIS) with openSeaChest_Configure -d /dev/… --puisFeature enable. The LSI SAS 9211-8i will not be able to spin up the disks with PUIS enabled, not even on the latest firmware.

While I was fumbling around, I enabled PUIS and rebooted only to discover that my hard drives disappeared and went cold. It was a lot of effort to recover as I had to take each of the hard drives out and plug them directly into a SATA connection to the motherboard so that I could run openSeaChest_Configure -d /dev/… --puisFeature disable.

Boot failure + openSeaChest segmentation fault

After diving deep in the Linux kernel (with @bacher09) using bpftrace and trace-cmd, we found the root cause of the weird system behavior I've been having.

I have a ZFS root:

root@box51 [~]# zfs get mountpoint rpool/ROOT/os
NAME           PROPERTY    VALUE       SOURCE
rpool/ROOT/os  mountpoint  /           received

But my hard drives contain a second ZFS pool. At some point, I accidentally replicated a dataset to this second ZFS pool with mountpoint=/:

root@box51 [~]# zfs get mountpoint spool/Backups/…/rpool/ROOT/os
NAME                           PROPERTY    VALUE       SOURCE
spool/Backups/…/rpool/ROOT/os  mountpoint  /           received

ZFS doesn't complain of a mount issue when importing this second pool that leads to a duplicate mountpoint=/. Instead, its symbolic link traversal will quietly follow this second pool if you reach / from a relative path like '/etc/mtab' -> '../proc/self/mounts'. The other dataset happens to have a /proc that is empty, which results in ../proc/self not resolving despite the original mount view having /proc/self/mounts:

root@box51 [~]# head /etc/mtab
head: cannot open '/etc/mtab' for reading: No such file or directory
root@box51 [~]# readlink /etc/mtab
../proc/self/mounts
root@box51 [~]# head /proc/self/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=49359908k,nr_inodes=12339977,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=9880868k,mode=755,inode64 0 0
rpool/ROOT/os / zfs rw,relatime,xattr,posixacl 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0

@bacher09 used trace-cmd record -p function_graph -g '*openat*' -F head /run/mtab to determine that zpl_get_link() was being used, but proc_self_get_link() should have appeared as the traversal enters the proc file system and the self path under /proc. This was not happening, which meant that the symbolic link dereferencing was staying within ZFS.

To try to pinpoint why the kernel wasn't reaching the proc file system, @bacher09 tried to attach kprobe/kretprobe probes, but either my kernel or my bpftrace wasn't allowed(?) to hook into the offsets of the mount traversal functions to find out which if statements were leading to the wrong path. @bacher09 could only make some educated guesses, like perhaps mounting over the initramfs environment was causing the problem if the initramfs's /proc was still present somewhere in the kernel.

Eventually, we noticed that by exporting the second ZFS pool, /etc/mtab started dereferencing successfully, and importing the pool again broke /etc/mtab right away. I then scanned the mountpoint properties with zfs get -r mountpoint and found that dataset with the duplicate /.

Although I didn't track it down, presumably /lib/systemd/systemd-networkd was waiting for some other symlinked file to appear, but the duplicate / ZFS mount masked it. Unsetting the mountpoint with zfs inherit mountpoint … immediately fixed the malfunctioning symlinks, I was able to boot successfully, and openSeaChest could access /etc/mtab again.


To conclude, my hard drives' EPC timers being ignored had nothing to do with openSeaChest. The workaround for an inaccessible /etc/mtab does prevent openSeaChest from crashing. The rest of my system weirdness was user error that triggered undefined behavior in the kernel and OpenZFS.

We can close this issue now that both of my problems have been resolved.

Thank you very much for the help and rubber-ducking, @vonericsen and @bacher09!

vonericsen commented 1 year ago

@Deltik,

Thank you for sharing that information and following up on the cause! I would not have expected a duplicate ZFS mountpoint to cause this, but I think what you said about the network waiting for something else to appear it likely correct.

I'm happy to hear the HBA firmware update resolved the EPC problem, and looking through the release notes I agree that it is not clear what exactly resolved the issue. I have had similar experiences when updating HBA firmware that the issue is resolved even if it is not directly mentioned in the change logs. My guess is the EPC issue you were seeing is related to one of those issues mentioned in the release notes but it was not known that EPC was also affected when LSI/Broadcom worked on fixing the problem.

I'm also glad that my fix to the code for opening /mtc/mtab also stopped the crash from occurring. While it was not the root of the issue, openSeaChest still should not have been crashing in the first place.

Because of the issue you saw when enabling PUIS, I have added a warning to the help output in openSeaChest about enabling that feature with SAS/SATA HBAs (like the one you used). The warning basically says to check the HBA documentation to see if this SATA feature is supported or not. I have seen this same issue a few years ago and had asked a couple HBA vendors about it. Some said they are looking into adding support and did not provide a timeline, others said it is only supported on specific HBA models and it is part of the documentation.