Seagate / openSeaChest

Cross platform utilities useful for performing various operations on SATA, SAS, NVMe, and USB storage devices.
Other
436 stars 60 forks source link

Exos X16 fails to change sector size on a Supermicro server #118

Open danderson opened 1 year ago

danderson commented 1 year ago

I'm doing initial setup on some new ST16000NM003G drives (Exos X16 16TB SATA). openSeaChest_Format -d /dev/sdi --showSupportedFormats says the drives support 4096b sectors, and are currently configured with 512b sectors. However, attempting to change the sector size fails with Set Sector Configuration Ext returning: ABORTED.

Hardware-wise, the drive is connected to a Supermicro SSG-5028R-E1CR12LA-CE010 server. The device chain from CPU to drive is:

Searching the issue tracker, I believe I'm seeing exactly the same symptoms as https://github.com/Seagate/openSeaChest/issues/79 , although possibly with slightly different hardware (X10 motherboard instead of X11, but also an LSI/BCM 3008 HBA, and also a supermicro server so likely similar backplane SAS expander).

I've attached the output of openSeaChest_Info -d /dev/sdi -i, openSeaChest_Format -d /dev/sdi --showSupportedFormats, and openSeaChest_Format -d /dev/sdi --setSectorSize=4096 --confirm this-will-erase-data-and-may-render-the-drive-inoperable.

sdi-info.txt sdi-supportedformats.txt sdi-format.txt

The linked issue has a workaround (execute the sector reconfig from a different system without all the LSI, Supermicro and SAS<>SATA stuff in the chain), so really I'm filing this issue to ask: is there any more data I could provide you to get to get more insight into this issue? Given that I can apparently reproduce it, and I'm going to be doing destructive burn-in on these drives for a few days, I can run debug commands and invasive drive changes without harming data.

danderson commented 1 year ago

Reproducing relevant info from #79, so people don't have to go digging: in that bug the reporter had a Supermicro X11DPH-T motherboard, and the same Supermicro AOC-S3008L-L8e HBA as me. No info on the backplane in that bug, but given Supermicro's product lineup, it seems likely that it's the same expander backplane as my system, since those boards don't change much even between different server models.

danderson commented 1 year ago

One more datapoint: I moved one of the drives to an older Supermicro server with a SAS2 storage chain, and I was able to change the sector size there successfully. Listing the hardware in that server too, just in case the A/B datapoints help:

This server is a franken-machine assembled from a used chassis+backplane, motherboard and HBA. This is not a configuration sold by Supermicro directly (whereas the one in my original report, afaik, is).

vonericsen commented 12 months ago

Hi @danderson, Thanks for the logs, I will take a look and see if I find something else that might help track this down. While debugging #79, I asked Seagate's engineer who works with Supermicro to test the Supermicro hardware we have and he could not repeat it. Seagate's engineer asked Supermicro's lab to also see if they could repeat this issue, but we never got it to repeat with the same hardware that was reported in that issue...so we really do not know what the issue is.

danderson commented 11 months ago

Thanks for taking a look! I don't envy having to track this through all the layers to find where things are going wrong.

I filed this purely in case it provides additional clues, or if I can provide further data about the configuration that wasn't working. If that's not the case, then I'm happy to close this bug as there's only so much digging that's possible across multiple vendors like this.

vonericsen commented 11 months ago

I reviewed the logs and I cannot figure out what would be wrong right now. Everything is being populated in the command correctly according to the specifications.

I've asked to see if someone in Seagate's firmware group can help me understand the spec's abort reason "the device is unable to complete processing of the command" to see if that can help me track it back to a feature interaction or something else in the firmware that I may be able to control. The other cases for the command abort from the spec are not the issue since the fields are all being filled in properly (unless for some reason the HBA firmware is filtering them out on the bus, but you would need a bus trace to see this).

The only other thing I can think of while I dig backwards is have you tried updating the HBA firmware at all? I'm not sure if it will fix it, but sometimes updating HBA firmware resolves odd things like this. In #111, updating the HBA firmware resolved a strange bug where the drive was not going into the idle or standby modes like it should. Maybe there is something similar going on here and causing the drive to think it cannot do the fast format right now because of some other bus activity from the HBA.