Seagate / ToolBin

All the great tools we have for the field.
126 stars 31 forks source link

seagateQuickFormat expected run time #23

Closed Martegnan closed 1 year ago

Martegnan commented 2 years ago

Hi,

We have two new 18TB drives that dropped out of setSectorSize, so I've been running seagateQuickFormat with the force flag to try to bring them back.

How long would you expect this format to take? From all descriptions I suspect relatively quickly and orders of magnitude less time than the week we've given them?

Thanks!

vonericsen commented 2 years ago

Hi @Martegnan, The quick format should only take a couple of minutes.

Can you confirm the tool and version you are using when they dropped out of the setSectorSize operation? Also, is this running in Linux or Windows?

Martegnan commented 2 years ago

Hi,

Thanks for confirming the expected run time.

The setSectorSize was run using SeaChest_Lite 1.5.0-2_2_3 X86_64 on Windows with the disks connected through an HBA. The command returned a successful response (hence running it on two disks before checking the result).

seagateQuickFormat was run using SeaChest_Format on Linux using your USB boot maker on separate machines with the disks directly connected to the chipsets. With that, though, I had little in the way of checking progress on the quick format beyond the disks seeming to be spinning but idle while the command runs. After cancelling out the disks are in the same state and subsequent runs appear to stall the same way.

With the power of hindsight I suspect the HBA the disks were connected to caused the drops when the command ran (helped by my naive approach of running the command with this setup), rather than the setSectorSize command. But am naturally wary of running the command on the other six disks waiting to be processed.

Any further troubleshooting steps you can recommend?

vonericsen commented 2 years ago

Hi @Martegnan,

The command returned a successful response (hence running it on two disks before checking the result).

The seagateQuickFormat should only be used if there was a failure from the setSectorSize option. Otherwise it is not needed....it should not be appearing to hang indefinitely though, so that is odd.

If you run SeaChest_GenericTests -d <handle> --shortGeneric, does this complete without errors? If a sector size change operation is interrupted, this will likely return a failure at some point as it runs.

One of the ways I've noticed that can help with checking when to run the seagateQuickFormat is running -i to see if the feature Set Sector Configuration is listed or not. When I've seen problems with the OS or HBAs interrupting the sector size change, this disappears from the list of supported features, then running the seagateQuickFormat can correct it. The version you are running can automatically detect this condition and start the seagateQuickFormat to recover the drive to a working condition automatically.

The other thing that can be helpful for debugging this from my side is to get the verbose output from the tool while it runs, but I understand not wanting to run a setSectorConfiguration command again. Could you do this for the seagateQuickFormat though? Example: SeaChest_Lite -d <handle> --seagateQuickFormat -v 3 > quickFormatVerbose.txt If this waits for longer than an hour, you can cancel it running with ctrl+c and that should still tell enough information for me to review it.

Unfortunately, the way the quick format and set sector size commands are defined, there is no progress indication for them. They are complete when the disk returns a completion status and busy until that happens.

Since I have not been able to recreate this kind of problem on the hardware I have been testing with, can you share which HBA you are using and which motherboard or the motherboard chipset?

We have been working towards a new version with a couple extra enhancements to make sure all partitions on the drive are unmounted first. This has helped with erasing data, but it should also help prevent these kinds of issues with setSectorSize as well. This version is still being tested to ensure it works well before we publish it, but these changes are already available in the develop branch of openSeaChest today.

Martegnan commented 2 years ago

Hi @vonericsen,

Thanks for your time with this.

The seagateQuickFormat should only be used if there was a failure from the setSectorSize option [...] running -i to see if the feature Set Sector Configuration is listed or not.

Yes, before first running seagateQuickFormat I confirmed that the Set Sector Configuration feature was missing in the -i output (in line with this the output from showSupportedFormats only listed 512 for these two drives). Indeed interesting that setting sector size returned a success message when it is meant to detect an error of this kind – if I find an old drive that supports changing sector sizes I can run it through the command under the same conditions and send you the output. May take some time.

shortGeneric does error out:

Starting short generic test.
Sequential Read Test at OD for 351566561 LBAs

Reading LBA: 0                   
Reading LBA: 0                   
Reading LBA: 0                   
Read failed within OD sequential read
Short generic test failed!

Interestingly, seagateQuickFormat returned successfully for one of the drives while collecting the verbose output. It did not, however, succeed with a subsequent overwrite command as per output instructions, now lists 512 for physical sector size as well, and shortGeneric results are the same. I attempted another seagateQuickFormat which then returned as failed.

For the other drive seagateQuickFormat continued to run until manually interrupted. I used tee instead of > so could see that the output stopped for a good while with the disk seemingly spinning but idle.

Here is the output of both commands: Disk 1 - seagateQuickFormat - successish.txt Disk 2 - seagateQuickFormat - stalled.txt

The HBA is an Adaptec 1100 8i, and the older and repurposed system has an Intel C600 chipset. There's an IBM backplane with a modest expander in the mix as well that also deserves some suspicion. A bit of a jumble.

We have been working towards a new version with a couple extra enhancements to make sure all partitions on the drive are unmounted first.

In this case case the disks were both still raw, so I expect interference of the kind these enhancements are addressing could not have occurred.

vonericsen commented 2 years ago

Hi @Martegnan, Thanks for the additional information. I'm reaching out to some people internally that may be better able to help me debug this since I am not sure why this is happening at this point and have not been able to recreate the problem myself.

I am investigating if the "stalled" quick format is a software bug as it appears to have returned a final status from the command, so it was not waiting on the drive at the point that I see in the file. This bug would not likely affect getting the quick format to complete. The "unaligned write command" status that is shown also does not make sense in this case. Assuming this was Linux's libata layer on the C600 chipset, I have looked into it before and it seems to be a default error condition when nothing else could be determined, so I think it's safe to say that the quick format was not returning success here.

The HBA is an Adaptec 1100 8i, and the older and repurposed system has an Intel C600 chipset. There's an IBM backplane with a modest expander in the mix as well that also deserves some suspicion. A bit of a jumble.

I am not aware of issues with these HBAs or hardware combinations, but I will check what hardware we have around that is as close as possible to test with. I have heard of interposers occasionally causing problems, but have not had any issues related to their use in a few years now. When these issues were present, the problem was being unable to issue any ATA passthrough commands, but that does not appear to be the issue in your case, so I do not think this would be related.

In this case case the disks were both still raw, so I expect interference of the kind these enhancements are addressing could not have occurred.

I agree. I do not think these additional changes would affect this case.

Do not hesitate to reach out to Seagate support while I continue investigating this issue to see what support may be able to do for your drives that have not recovered with the seagateQuickFormat. If you decide to contact Seagate support, please update this issue to indicate that you have contacted support so that I know how to keep this issue updated. I will keep this open until we can determine what the underlying cause of the problem is.

Martegnan commented 2 years ago

Hi @vonericsen,

Apologies, I was unclear in my last message about what hardware was used for which of the two issues.

I just identified the source of the original drops: The management module on the server (IBM IMM2) was interfering with the backplane, dropping disks independently of the HBA. Not sure why IBM configured the management module to relate to the disks at all when the server is set up for a RAID or HBA card – seems like too many cooks. I cannot see what specifically the module reacted to, only that they were considered faulty and were then disabled. Probably just some periodic or continuous health check and an aggressive reaction while the drive was busy as a protective reaction. I assume that such a reaction could cut data, power, or both to the drive, if that has bearing on the state of the drive after the drop and what that state could mean for the subsequent seagateQuickFormat.

In any case, with this confirmed, I have changed the sector size for the other six disks using the C226 machine without issues.

Thanks for your help looking into this. I will replace the two drives and initiate contact with Seagate support in case they can be brought back to life as spares. In the meantime, just let me know if there's anything you'd like me to try to figure out the seagateQuickFormat stalling.

vonericsen commented 2 years ago

Hi @Martegnan,

Thank you for the update! I'm glad you were able to change the sector size on your other drives without an issue.

That is very useful information. I think you are correct that the management module was trying to do a periodic check of some kind and could not complete it, then it interrupted the drive either by removing power or sending a bus reset of some kind.

Do you think adding a warning or more text to the tool saying something like this would help?

WARNING: Please ensure any management modules are disabled/removed as they may interrupt the set sector size process.

If this can be reworded better, please let me know, just looking for something that may help other customers if they are running a similar hardware configuration.

I am working with some teams internally to figure out what went wrong with the quick format or if there is something else that must be done so that the drives can be more easily recovered without returning to Seagate when this kind of issue happens.

Martegnan commented 2 years ago

For the warning, I think 'management module' is more an IBM-specific term and part of their IMM/IMM2 system name (now belonging to Lenovo after IBM sold their server division some years back). A more general term that should include systems like iDRAC and iLO could be 'out-of-band management system', though there may be better ones. I am not familiar enough with these systems to know whether others are prone to overriding controllers and their disk connections (nor how many sysadmins will be as naive as me and are in need of a warning), but a general term is probably more suitable?

I decided to test out an additional last-ditch step with these drives this afternoon:

And the drives now seem to be happily overwriting using SeaChest_Erase --overwrite 0 at from what I can tell is the expected full speed. We'll see when they're done if they pass what will need to be quite rigorous testing, but this suddenly looks promising.

vonericsen commented 2 years ago

I think out of band management system may be a good generic term. I do not know of a better one, but have heard of this plenty of times and it is used in some specifications to refer to these kinds of systems.

I think the number of people who know when these issues could happen is extremely low. Outside of someone directly involved in storage hardware, I do not expect many people to know that this could potentially cause a problem, so a good warning is a first step I would like to take to add to the software.

Since different out of band management systems could work different, maybe a warning like this is more appropriate:

WARNING: Some out-of-band management systems may interrupt a fast format/set sector size operation. If possible, disable or remove them before performing this operation.

Thanks for the information on the firmware download and retrying setting the sector size. We will do some testing to see if this is a good, repeatable process to keep since the quick format does not seem to be resolving the issue in your case.

vonericsen commented 1 year ago

We have added more warnings about setSectorSize/fast format and I refactored the code for the automatic quickFormat in case an error does occur to try and recover from any issues that can happen. This is all part of the new versions I have uploaded today and are also present in openSeaChest v23.03

I have not been able to find any other possible solutions to the OS, HBA, Driver, management system issuing resets while the fast format is running, so the warnings in the tool and the new confirmation of this risk are the best we can do for the moment.

Please feel free to reopen this issue if there is something else we can revisit about this issue.