EPC timers no longer working after long SMART test

luukrijnbende commented 1 year ago

Hi, I have 4 Seagate Exos X18 18TB drives (ST18000NM000J-2TV103) where the EPC timers are no longer working after a long SMART test. I am able to transition power states manually and until I access the drives again they stay in that state.

Things already tried:

Running another long SMART test
Upgrading firmware to SN04
Disabling and enabling EPC
Changing the EPC timers
Powercycle
Connect drives to a different system
Ensure no I/O is done, which is confirmed by the fact they stay in the manual power state

Below is the info of one of the drives:

/dev/sg0 - ST18000NM000J-2TV103 - ZR52TKTR - SN04 - ATA
        Model Number: ST18000NM000J-2TV103
        Serial Number: ZR52TKTR
        Firmware Revision: SN04
        World Wide Name: 5000C500DBDFF230
        Date Of Manufacture: Week 27, 2021
        Drive Capacity (TB/TiB): 18.00/16.37
        Native Drive Capacity (TB/TiB): 18.00/16.37
        Temperature Data:
                Current Temperature (C): 34
                Highest Temperature (C): 58
                Lowest Temperature (C): 21
        Power On Time:  1 year 6 days 16 hours
        Power On Hours: 15064.00
        MaxLBA: 35156656127
        Native MaxLBA: 35156656127
        Logical Sector Size (B): 512
        Physical Sector Size (B): 4096
        Sector Alignment: 0
        Rotation Rate (RPM): 7200
        Form Factor: 3.5"
        Last DST information:
                Time since last DST (hours): 17.00
                DST Status/Result: 0x0
                DST Test run: 0x2
        Long Drive Self Test Time:  1 day 1 hour 31 minutes
        Interface speed:
                Max Speed (Gb/s): 6.0
                Negotiated Speed (Gb/s): 6.0
        Annualized Workload Rate (TB/yr): 4163.70
        Total Bytes Read (PB): 7.15
        Total Bytes Written (TB): 13.82
        Encryption Support: Not Supported
        Cache Size (MiB): 256.00
        Read Look-Ahead: Enabled
        Write Cache: Enabled
        Low Current Spinup: Enabled
        SMART Status: Good
        ATA Security Information: Supported
        Firmware Download Support: Full, Segmented, Deferred
        Specifications Supported:
                ACS-4
                ACS-3
                ACS-2
                ATA8-ACS
                ATA/ATAPI-7
                ATA/ATAPI-6
                ATA/ATAPI-5
                SATA 3.3
                SATA 3.2
                SATA 3.1
                SATA 3.0
                SATA 2.6
                SATA 2.5
                SATA II: Extensions
                SATA 1.0a
                ATA8-AST
        Features Supported:
                Sanitize
                SATA NCQ
                SATA Software Settings Preservation [Enabled]
                SATA Device Initiated Power Management
                Power Management
                Security
                SMART [Enabled]
                48bit Address
                PUIS [Enabled]
                GPL
                Streaming
                SMART Self-Test
                SMART Error Logging
                Write-Read-Verify
                DSN
                AMAC
                EPC [Enabled]
                Sense Data Reporting [Enabled]
                SCT Write Same
                SCT Error Recovery Control
                SCT Feature Control
                SCT Data Tables
                Host Logging
                Set Sector Configuration
                Storage Element Depopulation + Restore
                Field Accessible Reliability Metrics (FARM)
                Seagate In Drive Diagnostics (IDD)
        Adapter Information:
                Adapter Type: PCI
                Vendor ID: 1022h
                Product ID: 43EBh
                Revision: 0000h

DebabrataSTX commented 1 year ago

We have an existing thread on a similar issue. Please have a look into that, it might help. https://github.com/Seagate/openSeaChest/issues/117

luukrijnbende commented 1 year ago

Thanks for your response. I've already looked at that thread and was thinking it could maybe be background activity related to the SMART test somehow, but there is no activity from the host so I don't know if there is something still running on the drives? Any way to check?

It seems odd that it magically starts working again after a certain amount of hours. However I'm not really able to let them run full tilt all the time until it fixes itself due to temperature and power constraints.

vonericsen commented 1 year ago

Hi @luukrijnbende,

I have been asking around about this to try and get an idea what is happening.

There is not enough information to know for sure, but the best guess is that the SMART self-test (long DST) paused some background activity that normally runs based on timing, so once the drive finished the SMART self-test the drive has been trying to catch up and finish that background work when it can.

There are lots of different kinds of background tasks in the drive, some are run periodically in allowed power states (active, idle_a) and others get scheduled as needed when the drive is used (reads, writes, etc). Background activity can be scheduled if it is for health monitoring, performance, reliability, or data-integrity reasons, so it is also entirely possible that something else triggered it that may not even be related to the SMART self-test. Background activity can be paused for reads, writes, and even a long self-test since the background activity is considered lower-priority than servicing these host requested commands/operations. Once the drive has enough idle time between these requests, it will attempt to do these background tasks. Lower power modes that unload the heads (idle_b, idle_c, standby_y, standby_z) are not allowed to start background activity which will also pause anything the drive has scheduled to run in the background.

lbogdan commented 1 year ago

@luukrijnbende I also hit this, after copying a few TB of data to some new Seagate IronWolf drives.

Did the issue eventually fix itself for you? (see my comment here)

smunaut commented 10 months ago

I have a similar situation ...I ran extended self test that took 12h to complete and disk are no longer going to idle. They've been in active mode for > 72h now and no sign of them resuming normal EPC operation ...

vonericsen commented 10 months ago

@smunaut,

Sorry I did not see your comment sooner. Is this still an issue for you after a few more days? This seems like a lot of additional time to get back to "normal" again. Have you tried power cycling the drive (like shut down, then power back up after 30 seconds or so)? Can you also share your MN and FW revision?

And if anyone else has seen this issue, has is occurred after a short DST? Or only after a Long DST?

smunaut commented 10 months ago

They're still not going to sleep, but since I posted they did spend 12h a day (overnight) in standby_z (manually transitioning them). The first 3 days I left them in active mode hoping they would finish whatever they were busy, but since them I put a script to put them to standby_z manually for the night. (And they do stay in that mode so the host is definitely not issuing any stray commands)

    Model Number: ST6000VN001-2BB186
    Firmware Revision: SC60

I did reboot for sure. I think I did a full power off but I'm not 100% sure anymore. I will retry that when I get the chance. (ATM the machine they are in is being used, It'll probably have to wait until next weekend)

smunaut commented 10 months ago

I shutdown the machine. Even unplugged it from the wall and let it sit for a bit. Then let the drives for 24h and still don't go to sleep according to their EPC timers.

vonericsen commented 9 months ago

@smunaut, Thank you for this update. We have been trying to repeat it internally but have not been able to so far.

Would you mind telling me more about the system hardware? Anything you can share would be helpful for us to try and figure out what is causing this. Some things that are helpful are:

controller/HBA is being used.
If that HBA has a firmware, what firmware version (if you can find that).
operating system and operating system version
If you installed a driver for the HBA, which version was installed
Which filesystem the drive is formatted with
Any management software or out of band controllers in the system, if any
Any background services/cronjobs that might be running (We know smartd can sometimes prevent going into low power modes if it is pinging the drive too frequently, but maybe there is another one out there affecting this too)

I know that not all of these things can be shared, however any additional system information that you can share could be helpful to see if it is something we can try repeating the issue on.

smunaut commented 9 months ago

It's a small NAS exposing volumes through NFS and Samba.

All drives are connected directly on SATA port of the mother board. It's a "MSI MPG B560I GAMING EDGE WIFI " running BIOS 7D19v18.
OS is Debian Bookworm stable
Drives are split, some part running RAID0, some part running RAID1 and some parts running RAID5. They are all hosting encrypted btrfs volume. There is a write-through bcache layer over them to a SSD to speed up access and also to avoid having to wake up the drives when reading small often accessed files.
No out of band management whatsoever.
There is a smartd running. I increased it's poll period to make sure it can at least go into idle_c between two poll (and then once it sees it's in idle_c, then smartd stops polling ). Just to make sure it wasn't the problem I also tried making short timeout for idle_b / idle_c and stopping smartd but that didn't help.
I also put the drive in standby_z and they remain there (unless I go and access the exposed share from a client obviously).

smunaut commented 9 months ago

Some other info :

On one of the drive I ran an "internal short self test" as in the other issue someone said it might have helped. Didn't notice any change from that
On another one I ran a short self test to see if it would make it snap out of it ... no change either.

vonericsen commented 9 months ago

Thank you for that info!

The original issue also listed AMD hardware. I will see if I can repeat this on any AMD hardware...maybe it's related, maybe it's not.

Would you mind sharing the output of openSeaChest_Basics -d <handle> --llInfo? This outputs some information like what was detected as the driver, the version of the driver, and if our code detected any filesystem that is mounted on this device, and some other info on how commands get routed through our utility.

smunaut commented 9 months ago

B560 is an Intel Chipset :grin: (here running an i3-10105). Yeah B550 is AMD, B560 is Intel confusing I know ...

==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.4.0-6_2_0 X86_64
 Build Date: Dec  1 2023
 Today: Mon Jan 29 18:00:14 2024    User: root
==========================================================================================

/dev/sg0 - ST6000VN001-2BB186 - ZR11NESD - SC60 - ATA

---Low Level tDevice information---
    ---Drive Info---
        media type: HDD
        drive type: ATA
        interface type: IDE/ATA
        zoned type: not zoned
        ---adapter info---
            PCI/PCIe:
            VendorID: 8086h
            ProductID: 43D2h
            Revision: 0011h
        ---driver info---
            driver name: ahci
            driver version string: 3.0

                major ver: 3
                minor ver: 0
        ---ata flags---
        SCSI Version: 7
        ---Passthrough Hacks---
            Passthrough type: SAT/system/none
                ---SCSI Hacks---
                ---NVMe Hacks---
                ---ATA Hacks---
    ---OS Info---
        handle name: /dev/sg0
        friendly name: sg0
        minimum memory alignment: 8
        ---Linux Unique info---
            FD is valid
            Second Handle name: /dev/sda
            Second Handle friendly name: sda
            SG Driver Version:
                Major: 3
                Minor: 5
                Revision: 36
        OS read-write recommended: false
        last recorded error: 22
        File system Info:
            No active file systems detected

vonericsen commented 9 months ago

B560 is an Intel Chipset 😁 (here running an i3-10105). Yeah B550 is AMD, B560 is Intel confusing I know ..

🤦 ...I have an MSI motherboard at home with an almost identical name....MSI MPG B550 Gaming Edge Wifi Naming these things so similarly makes it difficult to keep track of.

Anyways, thanks for the additional info. So it does not seem to be a hardware unique issue and looks like the standard AHCI driver is in use, so nothing specific to a driver that I have heard of before. We will keep trying to figure out how we can repeat this issue and see if we can figure out the root cause of the issue.

luukrijnbende commented 5 months ago

I have noticed that the EPC timers have started working again. When this first started happening I wrote a little script to monitor ZFS activity and put the drives into idle states if there is none, now after a reboot that script didn't come up again but EPC did work.

I have no idea what the trigger was, maybe they finished their background tasks? Though I did notice today that they were active for a few hours without activity and just now they returned to idle_c and are happy there.

Seagate / openSeaChest

EPC timers no longer working after long SMART test #126