Seagate / openSeaChest

Cross platform utilities useful for performing various operations on SATA, SAS, NVMe, and USB storage devices.
Other
489 stars 61 forks source link

X18 and X24 disks frequently reset with SAS3008 HBAs under heavy write load #162

Open putnam opened 3 days ago

putnam commented 3 days ago

I have a bunch (11 each) of ST24000NM000C and ST16000NM001G drives that cause major issues with my SAS3008-based HBA (the onboard HBA on the Supermicro H12SSL-CT, but also just on a regular 9300-8i). Specifically the HBA hits some failure mode under heavy write loads to these new X24's and the driver triggers a whole HBA reset. Heavy reads seem to not be affected.

The X18 default EPC settings vary vs. the X24's. They seem to have Idle_A set to 1 and Idle_B set to 1200; the X24 firmware only has Idle_A set to 1. The first time I saw this occur, I disabled EPC on the new X24's with --EPCfeature disable, and I thought it was resolved, but the next time I had a pretty sustained write load it happened again.

I didn't have this issue when it was purely the X18 disks on this adapter. It was only once the X24s were added to the mix that I saw this occur. It also does not occur with HGST/WD disks.

All X18 disks are on SN02, except one RMA refurbed ST16000NM000J on SN04. All X24 disks are on SN02. The SAS3008 HBA is on 16.00.14.00. It is actively cooled and temp is monitored and not overheating. Disks are all attached on a Supermicro 846 SAS3 backplane/LSI expander on 66.16.11.00. Kernel is 6.10.11-amd64, current Debian testing/trixie.

Here's dmesg during a heavy write load triggering the problem:

[Wed Oct 16 01:13:02 2024] mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
[Wed Oct 16 01:13:02 2024] mpt3sas_cm0: fault_state(0x5854)!
[Wed Oct 16 01:13:02 2024] mpt3sas_cm0: sending diag reset !!
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: diag reset: SUCCESS
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: In func: _ctl_do_mpt_command
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: Command terminated due to Host Reset
[Wed Oct 16 01:13:03 2024] mf:

[Wed Oct 16 01:13:03 2024] 0000000b
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000018
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000008
[Wed Oct 16 01:13:03 2024]

[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 0000000a
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 02000000
[Wed Oct 16 01:13:03 2024]

[Wed Oct 16 01:13:03 2024] 00000025
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000

[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.14.00), ChipRevision(0x02)
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: sending port enable !!
[Wed Oct 16 01:13:10 2024] mpt3sas_cm0: port enable: SUCCESS
[Wed Oct 16 01:13:10 2024] mpt3sas_cm0: search for end-devices: start
[Wed Oct 16 01:13:10 2024] scsi target0:0:0: handle(0x000a), sas_addr(0x5003048017ab9940)
[Wed Oct 16 01:13:10 2024] scsi target0:0:0: enclosure logical id(0x5003048017ab997f), slot(0)
[Wed Oct 16 01:13:10 2024] scsi target0:0:1: handle(0x000b), sas_addr(0x5003048017ab9941)
[Wed Oct 16 01:13:10 2024] scsi target0:0:1: enclosure logical id(0x5003048017ab997f), slot(1)
[Wed Oct 16 01:13:10 2024] scsi target0:0:2: handle(0x000c), sas_addr(0x5003048017ab9942)
[Wed Oct 16 01:13:10 2024] scsi target0:0:2: enclosure logical id(0x5003048017ab997f), slot(2)
[Wed Oct 16 01:13:10 2024] scsi target0:0:3: handle(0x000d), sas_addr(0x5003048017ab9943)
[Wed Oct 16 01:13:10 2024] scsi target0:0:3: enclosure logical id(0x5003048017ab997f), slot(3)
[Wed Oct 16 01:13:10 2024] scsi target0:0:4: handle(0x000e), sas_addr(0x5003048017ab9944)
[Wed Oct 16 01:13:10 2024] scsi target0:0:4: enclosure logical id(0x5003048017ab997f), slot(4)
[Wed Oct 16 01:13:10 2024] scsi target0:0:5: handle(0x000f), sas_addr(0x5003048017ab9945)
[Wed Oct 16 01:13:10 2024] scsi target0:0:5: enclosure logical id(0x5003048017ab997f), slot(5)
[Wed Oct 16 01:13:10 2024] scsi target0:0:6: handle(0x0010), sas_addr(0x5003048017ab9946)
[Wed Oct 16 01:13:10 2024] scsi target0:0:6: enclosure logical id(0x5003048017ab997f), slot(6)
[Wed Oct 16 01:13:10 2024] scsi target0:0:7: handle(0x0011), sas_addr(0x5003048017ab9947)
[Wed Oct 16 01:13:10 2024] scsi target0:0:7: enclosure logical id(0x5003048017ab997f), slot(7)
[Wed Oct 16 01:13:10 2024] scsi target0:0:8: handle(0x0012), sas_addr(0x5003048017ab9948)
[Wed Oct 16 01:13:10 2024] scsi target0:0:8: enclosure logical id(0x5003048017ab997f), slot(8)
[Wed Oct 16 01:13:10 2024] scsi target0:0:9: handle(0x0013), sas_addr(0x5003048017ab9949)
[Wed Oct 16 01:13:10 2024] scsi target0:0:9: enclosure logical id(0x5003048017ab997f), slot(9)
[Wed Oct 16 01:13:10 2024] scsi target0:0:10: handle(0x0014), sas_addr(0x5003048017ab994a)
[Wed Oct 16 01:13:10 2024] scsi target0:0:10: enclosure logical id(0x5003048017ab997f), slot(10)
[Wed Oct 16 01:13:10 2024] scsi target0:0:11: handle(0x0015), sas_addr(0x5003048017ab994b)
[Wed Oct 16 01:13:10 2024] scsi target0:0:11: enclosure logical id(0x5003048017ab997f), slot(11)
[Wed Oct 16 01:13:10 2024] scsi target0:0:12: handle(0x0016), sas_addr(0x5003048017ab995c)
[Wed Oct 16 01:13:10 2024] scsi target0:0:12: enclosure logical id(0x5003048017ab997f), slot(12)
[Wed Oct 16 01:13:10 2024] scsi target0:0:13: handle(0x0017), sas_addr(0x5003048017ab995d)
[Wed Oct 16 01:13:10 2024] scsi target0:0:13: enclosure logical id(0x5003048017ab997f), slot(13)
[Wed Oct 16 01:13:11 2024] scsi target0:0:14: handle(0x0018), sas_addr(0x5003048017ab995e)
[Wed Oct 16 01:13:11 2024] scsi target0:0:14: enclosure logical id(0x5003048017ab997f), slot(14)
[Wed Oct 16 01:13:11 2024] scsi target0:0:15: handle(0x0019), sas_addr(0x5003048017ab995f)
[Wed Oct 16 01:13:11 2024] scsi target0:0:15: enclosure logical id(0x5003048017ab997f), slot(15)
[Wed Oct 16 01:13:11 2024] scsi target0:0:16: handle(0x001a), sas_addr(0x5003048017ab9960)
[Wed Oct 16 01:13:11 2024] scsi target0:0:16: enclosure logical id(0x5003048017ab997f), slot(16)
[Wed Oct 16 01:13:11 2024] scsi target0:0:17: handle(0x001b), sas_addr(0x5003048017ab9961)
[Wed Oct 16 01:13:11 2024] scsi target0:0:17: enclosure logical id(0x5003048017ab997f), slot(17)
[Wed Oct 16 01:13:11 2024] scsi target0:0:18: handle(0x001c), sas_addr(0x5003048017ab9963)
[Wed Oct 16 01:13:11 2024] scsi target0:0:18: enclosure logical id(0x5003048017ab997f), slot(19)
[Wed Oct 16 01:13:11 2024] scsi target0:0:19: handle(0x001d), sas_addr(0x5003048017ab9964)
[Wed Oct 16 01:13:11 2024] scsi target0:0:19: enclosure logical id(0x5003048017ab997f), slot(20)
[Wed Oct 16 01:13:11 2024] scsi target0:0:20: handle(0x001e), sas_addr(0x5003048017ab9966)
[Wed Oct 16 01:13:11 2024] scsi target0:0:20: enclosure logical id(0x5003048017ab997f), slot(22)
[Wed Oct 16 01:13:11 2024] scsi target0:0:21: handle(0x001f), sas_addr(0x5003048017ab9967)
[Wed Oct 16 01:13:11 2024] scsi target0:0:21: enclosure logical id(0x5003048017ab997f), slot(23)
[Wed Oct 16 01:13:11 2024] scsi target0:0:22: handle(0x0020), sas_addr(0x5003048017ab997d)
[Wed Oct 16 01:13:11 2024] scsi target0:0:22: enclosure logical id(0x5003048017ab997f), slot(24)
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for end-devices: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for end-devices: start
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for PCIe end-devices: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for expanders: start
[Wed Oct 16 01:13:11 2024]      expander present: handle(0x0009), sas_addr(0x5003048017ab997f), port:255
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for expanders: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Wed Oct 16 01:13:11 2024] sd 0:0:0:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:4:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:9:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:1:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:11:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:3:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:17:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:6:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:7:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:8:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:10:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:12:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:13:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:14:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:15:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:16:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:18:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:19:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:20:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:2:0: device_block, handle(0x000c)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: end-devices
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: expanders
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: complete
[Wed Oct 16 01:13:12 2024] sd 0:0:2:0: device_unblock and setting to running, handle(0x000c)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: scan devices: start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: expanders start
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: attempting task abort!scmd(0x00000000bfccca11), outstanding for 2948 ms & timeout 1000 ms
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: [sde] tag#187 CDB: ATA command pass through(16) 85 08 0e 00 d5 00 01 00 e0 00 4f 00 c2 00 b0 00
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: handle(0x000f), sas_address(0x5003048017ab9945), phy(5)
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: enclosure logical id(0x5003048017ab997f), slot(5)
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: enclosure level(0x0000), connector name(     )
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: expanders complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: end devices start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: end devices complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: pcie end devices start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         pcie devices: pcie end devices complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: scan devices: complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: device is not present handle(0x000c), flags!!!
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000bfccca11)
[Wed Oct 16 01:13:20 2024] sd 0:0:2:0: Power-on or device reset occurred
[Wed Oct 16 01:13:20 2024] sd 0:0:5:0: Power-on or device reset occurred
[Wed Oct 16 01:13:20 2024] sd 0:0:21:0: Power-on or device reset occurred

I contacted Seagate support and uh, they told me to install some Windows-only software to monitor for firmware updates and didn't know how to respond to anything technical at all. So I hope maybe through you guys this info is useful.

vonericsen commented 2 days ago

Hi @putnam,

Sorry you are having issues in your system. To make sure I am following your issue this is what you have seen happening:

Is this correct?

From the standards, disabling EPC should hold across resets and power cycles. Are you seeing that the EPC feature is being enabled again even when you have not sent the enable?

As for firmware updates, sometimes those can help (both HBA side and drive side). From the Seagate support site there is a Firmware update finder that you can provide a serial number to check for new firmware. You don't need the other Windows only tool (it basically scans and opens that webpage for you with the SN already loaded). If you scroll to the bottom of this page you can provide a serial number of a drive to check for newer firmware. I don't know if that would resolve the issue, but you can try it.

I am asking around to see if any of the customer support engineers have run into this as well, but I have not heard anything yet.

putnam commented 2 days ago

Thanks so much for the response. I edited my original ticket a lot, so I think you're responding to the initial version. I realized, looking at bash history and the state of the disks, that:

  1. --EPCfeature disable did actually persist. You're right.
  2. Disabling EPC didn't resolve the issues with the X24 disks after all, because I didn't truly load them with writes. Once they were loaded with writes again the same behavior came back.

I'm sure this is now outside the scope of this repo, but you guys have been so useful in the past when reporting possible firmware bugs. Maybe it's useful to have shared it here anyway. I'm not an enterprise customer, just an end user, so it's hard to get a line to someone with inside engineering connections.

I can repro more consistently now by just copying a lot of data to the disks. I have found very little info on these particular 20TB models since I understand they're technically binned/refurbed X24 HAMR disks. It may well be an issue with the LSI/Broadcom firmware or even mpt3sas, but again it doesn't repro on my 60+ HGST/WD disks or the X16's on their own.

Since we're almost certainly outside the scope of openSeaChest here feel free to close but if it's something you guys are open to pursuing with more debug data and info I could share it here or over email privately.

Regarding firmware on the end user portal there's no update available for these yet.

vonericsen commented 1 day ago

@putnam,

I did pass this issue along to some people internally to see if they've seen similar problems before with these drives and hardware, but I have not heard anything yet.

If you dump the SATA phy event counters, are you seeing those increase at all? openSeaChest_Info -d <handle> --showPhyEvents

If these are increasing (not just the reset counter, but others) if can point towards a cabling issue.

I'll see if there is anything else I can think of trying that might also help debug this.

putnam commented 1 day ago

Thanks for the reply! OK, so here are the PHY counters from openSeaChest_Info --showPhyEvents for the different Seagate disks hanging onto this backplane/controller. Do you know whether this is a rolling window or lifetime? Back in September when I first got these disks, I replaced the internal SAS cables due to CRC errors during the initial ZFS resilvering. You know, changing firmwares and cables and the workload is always a bunch of variable juggling and I don't want to get it wrong here. But when I changed the internal cables, the resilver continued without any issue or drops and the CRC errors went away at that time. I also have multiple brand new 3M and Amphenol cables on the shelf here and can swap them in to try to eliminate the cable variable one more time, if you like. It wouldn't be the first, or the second, or the third time that cabling randomly came up. In the last 10+ years of dealing with SAS2/SAS3 I feel like cables are an evergreen issue that everyone faces.

Anyway, the resets I see now are specifically when ZFS is copying a large amount of data to the pool and is lighting up the vdevs made up of Seagate devices for a sustained amount of time. Eventually, you see the same message about the HBA resetting with the same fault code in mpt3sas. I did some digging in the mpt3sas driver hoping to find some bitflags or something to identify the fault code but it looks to be internal/proprietary to Broadcom/LSI.

20TB X24 Disks (Newer) ``` ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA04RL6 - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 16 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 2 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA09AWD - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 11 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 2 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA09KJB - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 12 H2D FISes sent due to COMRESET 1 3 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 3 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 3 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA09QQH - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 13 H2D FISes sent due to COMRESET 1 0 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 0 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 0 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA09RSQ - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 16 H2D FISes sent due to COMRESET 1 3 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 3 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 3 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0BKFL - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 10 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 2 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160234 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0C241 - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 17 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 2 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160235 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0CWPX - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 13 H2D FISes sent due to COMRESET 1 4 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 4 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 4 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160235 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0D2EL - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 12 H2D FISes sent due to COMRESET 1 0 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 0 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 0 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160235 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0EWXY - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 13 H2D FISes sent due to COMRESET 1 5 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 5 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS 11 5 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160235 User: root ========================================================================================== - ST24000NM000C-3WD103 - ZXA0GJGN - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 19 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 1 R_ERR response for H2D non-data FIS 11 2 CRC errors withing H2D FIS 13 0 Non-CRC errors within H2D FIS ```
16TB X18 Disks (Older, pre-existing without resets) ``` ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160554 User: root ========================================================================================== - ST16000NM000J-2TW103 - ZR5ECA55 - SN04 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 3 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160554 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL20CAJ9 - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 4 H2D FISes sent due to COMRESET 1 0 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 0 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160554 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL20D3TL - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 7 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160554 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL20YT4M - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 5 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160554 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL213QN0 - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 4 H2D FISes sent due to COMRESET 1 0 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 0 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21909L - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 7 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21AHY7 - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 12 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 1 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21L84X - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 16 H2D FISes sent due to COMRESET 1 2 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 2 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21L97Y - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 4 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21LGDW - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 4 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ========================================================================================== openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_Info Version: 2.7.0-8_0_1 X86_64 Build Date: Sep 25 2024 Today: 20241017T160555 User: root ========================================================================================== - ST16000NM001G-2KK103 - ZL21LYZC - SN02 - ATA ====SATA Phy Event Counters==== V = Vendor Unique event tracker M = Counter maximum value reached D2H = Device to Host H2D = Host to Device ID Value Description 10 6 H2D FISes sent due to COMRESET 1 1 Command failed with iCRC error 3 0 R_ERR response for D2H data FIS 4 1 R_ERR response for H2D data FIS 6 0 R_ERR response for D2H non-data FIS 7 0 R_ERR response for H2D non-data FIS ```
vonericsen commented 1 day ago

Do you know whether this is a rolling window or lifetime?

For this page it continues counting until you reset the counters on the page. I don't remember if we put that in as an option in openSeaChest yet. I will have to review the code.

The reason I mentioned the CRC errors is due to some of my own past experience trying to troubleshoot some issues other customers have seen.

I have also had some long conversations with one of the Seagate engineers who works on the phy level with the goal of figuring out a way to write a test for detecting a bad cable. It's not an easy task 😆 but we did come up with some ideas including using these logs. I have not had time to implement it yet, but it will be an expanded version of the openSeaChest_GenericTests --bufferTest routine I already have....sometimes that will detect an error, but it runs far too short of a time to be reliable right now.

One thing I learned from him was that the faster the interface is running (6Gb/s vs 3Gb/s) the sooner you notice signaling issues. The most common is seeing the CRC counters increasing. This is often increasing due to a cabling problem....not always, but in your case I suspect it is since it's happening on multiple different drives, even drives that were not previously having an issue. It's possible that these new drives have a slightly different phy behavior that managed to bring this out. There are a couple different issues that can happen on the bus that HBAs and drives are both trying to mitigate (such as signal reflections) but sometimes that can only go so far before it's no longer correctable. There are also limits on how many signal level issues can be worked around and with these new drives maybe some existing problem was manageable that is no longer manageable (just guessing here).

Another thing that can happen (and I have experienced myself) is similar things happen as the backplane connectors wear out from plugging and unplugging drives. Eventually all connectors will fail but as you approach the insertion count limit you can start to see these kinds of issues too.

I don't know if any of these will solve the issue, but you can try these things:

  1. Unplug the drives and plug them back in (sometimes this reseats the connector better and may mitigate this issue)
  2. If you have backplanes and can replace them easily, maybe give it a try
  3. Replacing cables in the system.

openSeaChest_Configure also has an option to set the phy speed lower as well, which you can also try but it may limit your maximum sequential read/write on more modern drives. DO NOT go below 3.0Gb/s though. I found out that some modern SAS/SATA controllers no longer support 1.5Gb/s and once the drive is set to that you will have to track down another HBA that does support that low speed to restore it to a higher speed. I found this in the HBA documentation, so you can also check that to see what it supports first.

One last thing I want to mention is that if you can check for updates on the HBA firmware that may also help. I have seen that resolve odd behavior issues as well due to fixes made to the HBA's firmware. I have seen some past Broadcom HBA's resolve some odd phy issues before, but I don't know if that is affecting this specific case.

Let me know if this helps. I'll see if I can talk to that signal engineer I mentioned about this to see if he has any other ideas.

putnam commented 1 day ago

Thanks. Will go over and try. Regarding the HBA, it's a pretty common SAS3008 HBA and on latest firmware (16.00.14.00). The backplane hasn't had a ton of insertion cycles, but reseating can't hurt. I will swap to a new-in-bag Amphenol cable set + reseat disks and see if I can repro again and report back.