Open luukrijnbende opened 1 year ago
We have an existing thread on a similar issue. Please have a look into that, it might help. https://github.com/Seagate/openSeaChest/issues/117
Thanks for your response. I've already looked at that thread and was thinking it could maybe be background activity related to the SMART test somehow, but there is no activity from the host so I don't know if there is something still running on the drives? Any way to check?
It seems odd that it magically starts working again after a certain amount of hours. However I'm not really able to let them run full tilt all the time until it fixes itself due to temperature and power constraints.
Hi @luukrijnbende,
I have been asking around about this to try and get an idea what is happening.
There is not enough information to know for sure, but the best guess is that the SMART self-test (long DST) paused some background activity that normally runs based on timing, so once the drive finished the SMART self-test the drive has been trying to catch up and finish that background work when it can.
There are lots of different kinds of background tasks in the drive, some are run periodically in allowed power states (active, idle_a) and others get scheduled as needed when the drive is used (reads, writes, etc). Background activity can be scheduled if it is for health monitoring, performance, reliability, or data-integrity reasons, so it is also entirely possible that something else triggered it that may not even be related to the SMART self-test. Background activity can be paused for reads, writes, and even a long self-test since the background activity is considered lower-priority than servicing these host requested commands/operations. Once the drive has enough idle time between these requests, it will attempt to do these background tasks. Lower power modes that unload the heads (idle_b, idle_c, standby_y, standby_z) are not allowed to start background activity which will also pause anything the drive has scheduled to run in the background.
@luukrijnbende I also hit this, after copying a few TB of data to some new Seagate IronWolf drives.
Did the issue eventually fix itself for you? (see my comment here)
I have a similar situation ...I ran extended self test that took 12h to complete and disk are no longer going to idle. They've been in active mode for > 72h now and no sign of them resuming normal EPC operation ...
@smunaut,
Sorry I did not see your comment sooner. Is this still an issue for you after a few more days? This seems like a lot of additional time to get back to "normal" again. Have you tried power cycling the drive (like shut down, then power back up after 30 seconds or so)? Can you also share your MN and FW revision?
And if anyone else has seen this issue, has is occurred after a short DST? Or only after a Long DST?
They're still not going to sleep, but since I posted they did spend 12h a day (overnight) in standby_z (manually transitioning them). The first 3 days I left them in active mode hoping they would finish whatever they were busy, but since them I put a script to put them to standby_z manually for the night. (And they do stay in that mode so the host is definitely not issuing any stray commands)
Model Number: ST6000VN001-2BB186
Firmware Revision: SC60
I did reboot for sure. I think I did a full power off but I'm not 100% sure anymore. I will retry that when I get the chance. (ATM the machine they are in is being used, It'll probably have to wait until next weekend)
I shutdown the machine. Even unplugged it from the wall and let it sit for a bit. Then let the drives for 24h and still don't go to sleep according to their EPC timers.
@smunaut, Thank you for this update. We have been trying to repeat it internally but have not been able to so far.
Would you mind telling me more about the system hardware? Anything you can share would be helpful for us to try and figure out what is causing this. Some things that are helpful are:
I know that not all of these things can be shared, however any additional system information that you can share could be helpful to see if it is something we can try repeating the issue on.
It's a small NAS exposing volumes through NFS and Samba.
idle_c
between two poll (and then once it sees it's in idle_c
, then smartd stops polling ). Just to make sure it wasn't the problem I also tried making short timeout for idle_b / idle_c and stopping smartd but that didn't help.Some other info :
Thank you for that info!
The original issue also listed AMD hardware. I will see if I can repeat this on any AMD hardware...maybe it's related, maybe it's not.
Would you mind sharing the output of openSeaChest_Basics -d <handle> --llInfo
?
This outputs some information like what was detected as the driver, the version of the driver, and if our code detected any filesystem that is mounted on this device, and some other info on how commands get routed through our utility.
B560 is an Intel Chipset :grin: (here running an i3-10105). Yeah B550 is AMD, B560 is Intel confusing I know ...
==========================================================================================
openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
openSeaChest_PowerControl Version: 3.4.0-6_2_0 X86_64
Build Date: Dec 1 2023
Today: Mon Jan 29 18:00:14 2024 User: root
==========================================================================================
/dev/sg0 - ST6000VN001-2BB186 - ZR11NESD - SC60 - ATA
---Low Level tDevice information---
---Drive Info---
media type: HDD
drive type: ATA
interface type: IDE/ATA
zoned type: not zoned
---adapter info---
PCI/PCIe:
VendorID: 8086h
ProductID: 43D2h
Revision: 0011h
---driver info---
driver name: ahci
driver version string: 3.0
major ver: 3
minor ver: 0
---ata flags---
SCSI Version: 7
---Passthrough Hacks---
Passthrough type: SAT/system/none
---SCSI Hacks---
---NVMe Hacks---
---ATA Hacks---
---OS Info---
handle name: /dev/sg0
friendly name: sg0
minimum memory alignment: 8
---Linux Unique info---
FD is valid
Second Handle name: /dev/sda
Second Handle friendly name: sda
SG Driver Version:
Major: 3
Minor: 5
Revision: 36
OS read-write recommended: false
last recorded error: 22
File system Info:
No active file systems detected
B560 is an Intel Chipset 😁 (here running an i3-10105). Yeah B550 is AMD, B560 is Intel confusing I know ..
🤦 ...I have an MSI motherboard at home with an almost identical name....MSI MPG B550 Gaming Edge Wifi Naming these things so similarly makes it difficult to keep track of.
Anyways, thanks for the additional info. So it does not seem to be a hardware unique issue and looks like the standard AHCI driver is in use, so nothing specific to a driver that I have heard of before. We will keep trying to figure out how we can repeat this issue and see if we can figure out the root cause of the issue.
I have noticed that the EPC timers have started working again. When this first started happening I wrote a little script to monitor ZFS activity and put the drives into idle states if there is none, now after a reboot that script didn't come up again but EPC did work.
I have no idea what the trigger was, maybe they finished their background tasks? Though I did notice today that they were active for a few hours without activity and just now they returned to idle_c and are happy there.
Hi, I have 4 Seagate Exos X18 18TB drives (ST18000NM000J-2TV103) where the EPC timers are no longer working after a long SMART test. I am able to transition power states manually and until I access the drives again they stay in that state.
Things already tried:
SN04
Below is the info of one of the drives: