Seagate / openSeaChest

Cross platform utilities useful for performing various operations on SATA, SAS, NVMe, and USB storage devices.
Other
479 stars 61 forks source link

I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory #123

Open FurkanGozukara opened 1 year ago

FurkanGozukara commented 1 year ago

I can't solve this error

The disks are genuine. I checked the QR code on them

The Raid 1 keep getting degraded. Then I reset disks to non-raid do a full check with chkdsk /f /r /x

0 Errors found

Then I do byte comparison of each file in each disk and they are exactly identical

But Raid 1 is kept getting degraded

How can I debug this issue?

Here my disks and drivers

I am using Windows 10

image

image

image

image

image

image

image

Swiss3003 commented 1 year ago

Furken, I looked at the pics, saw that all the smart looks good and is showing no errors. I did see that the drive was running at 6gps, does the controller do 12 gps? The other types of errors can be "end to end crc error" or phy layer errors. Maybe look for these types of errors. Also try a new data cable to the drive, make sure it's not close to something that can add noise into the cable.

Try looking at seachest_smart for pulling the drive statistics log. Should be more information in that log. Also I think smartctl has that option also.

SeaChest_SMART -d /dev/sg<#> --deviceStatistics

Maybe also run DST on the drive to make sure the drive is healthy. SeaChest_SMART -d /dev/sg<#> --shortDST --captive SeaChest_SMART -d /dev/sg<#> --showDSTLog

also look in the comp error log and see if you see any errors SeaChest_SMART -d /dev/sg<#> --showSMARTErrorLog comprehensive SeaChest_SMART -d /dev/sg<#> --showSMARTErrorLog summary --smartErrorLogFormat raw

I think that's all of my idea's that I can think of today.

Tim Gilmer Staff Engineer Field Diags Office: (720)-684-2624 Seagate Technology [cid:5502dce5-cfd7-4db1-8b1d-130c980088ef]

Seagate Internal


From: Furkan Gözükara @.> Sent: Monday, September 25, 2023 6:07 PM To: Seagate/openSeaChest @.> Cc: Subscribed @.***> Subject: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

I can't solve this error

The disks are genuine. I checked the QR code on them

The Raid 1 keep getting degraded. Then I reset disks to non-raid do a full check with chkdsk /f /r /x

0 Errors found

Then I do byte comparison of each file in each disk and they are exactly identical

But Raid 1 is kept getting degraded

How can I debug this issue?

Here my disks and drivers

I am using Windows 10

[image]https://user-images.githubusercontent.com/19240467/270498861-69eaf9bc-644b-44ec-83ff-7949780cc3f3.png

[image]https://user-images.githubusercontent.com/19240467/270498610-1b41c8ba-da47-4ff1-9456-dcfec77ffda9.png

[image]https://user-images.githubusercontent.com/19240467/270498637-e3b09929-444c-4211-bb9d-1e682549f88f.png

[image]https://user-images.githubusercontent.com/19240467/270498667-d2c08980-8532-4687-8b46-cab5545c8417.png

[image]https://user-images.githubusercontent.com/19240467/270498698-57268a7d-62e6-466f-b730-a8710088934d.png

[image]https://user-images.githubusercontent.com/19240467/270498734-9f7c2667-2225-47fb-a820-cb7c94a3c199.png

[image]https://user-images.githubusercontent.com/19240467/270498800-ee8d0389-9339-44f4-b697-5c96c177b850.png

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2GWWRZLM3IA6QWOMFLX4IMD7ANCNFSM6AAAAAA5G3JQIQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

FurkanGozukara commented 1 year ago

Thanks for answers @Swiss3003

here some results. any ideas?

image

image

image

image

image

image

Swiss3003 commented 1 year ago

So, if DST passes then the drives are fine. The only thing I see is a very high number of resets on the interface side. Also looks like you are seeing CRC errors from the interface. I would still change out the data cable and start looking and the controller card for any issues. [cid:ea114452-e664-48c4-9ad6-e31ea450b206] [cid:da131d63-2116-4025-8973-8151e31d19af]

Tim Gilmer Staff Engineer Field Diags Office: (720)-684-2624 Seagate Technology [cid:f2aaffb2-d21b-4aaf-a70d-7c6fcfa13a89]

Seagate Internal


From: Furkan Gözükara @.> Sent: Tuesday, September 26, 2023 9:41 AM To: Seagate/openSeaChest @.> Cc: Tim Gilmer @.>; Mention @.> Subject: Re: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

Thanks for answers @Swiss3003https://github.com/Swiss3003

here some results. any ideas?

[image]https://user-images.githubusercontent.com/19240467/270714230-b0a69811-22e3-4a32-9dbd-fb7b1b055f80.png

[image]https://user-images.githubusercontent.com/19240467/270714318-761fd2f4-884e-416e-b228-6bb35d19d4c8.png

[image]https://user-images.githubusercontent.com/19240467/270714476-4b48a3ed-624c-4fbb-a8ed-470dda334e94.png

[image]https://user-images.githubusercontent.com/19240467/270714527-6ba2e383-5ac3-4bd4-9225-6bd17076bdfe.png

[image]https://user-images.githubusercontent.com/19240467/270714741-e351dc56-a439-423c-bdb4-f4fc27de6118.png

[image]https://user-images.githubusercontent.com/19240467/270714796-5224feff-8ce1-4f27-82d0-48157c02ff54.png

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123#issuecomment-1735807428, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2AFEJHLUJ3DVLIC6KTX4LZUPANCNFSM6AAAAAA5G3JQIQ. You are receiving this because you were mentioned.Message ID: @.***>

FurkanGozukara commented 1 year ago

@Swiss3003 thanks

I changed both ports and cables (brand new). testing again right now

Only DST left

what are CRC errors meaning?

Also which controller card you would suggest me as raid 1?

Swiss3003 commented 1 year ago

Sounds good. dst shouldn't take log. then pull the log and that will tell you if you have a bad drive or not.

CRC - a good way to think of it is a checksum created by the host. Then the data is pushed to the device ( through the card - through the data cable ( there is more to the stack then just those two)) The device gets the data and creates it's own checksum then the two are compared to make sure the checksums are the same. if not the same then you get a CRC error. ( that is the simple version) The CRC check was created a long time ago for IDE for the noise issues that were seen in the data cable, and also the chipsets of cheap controllers. As the single got faster and sharper the more the CRC check became more important. Also CRC does more today than it used to. It has error correction for the data, helps with the phy singles and words, and helps with the boots of drives for the OS, just to name a few.

Cards - hmmm. I don't see a lot of different ones anymore. Seems like I always have a LSI or Broadcom around to test with. I do have a few Broadcom 95** meagaraid cards that seem to work nice. Hope that helps

Tim Gilmer Staff Engineer Field Diags Office: (720)-684-2624 Seagate Technology [cid:c6899a5b-6724-4b2b-9e6c-9b53178f706a]

Seagate Internal


From: Furkan Gözükara @.> Sent: Wednesday, September 27, 2023 9:13 AM To: Seagate/openSeaChest @.> Cc: Tim Gilmer @.>; Mention @.> Subject: Re: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

@Swiss3003https://github.com/Swiss3003 thanks

I changed both ports and cables (brand new). testing again right now

Only DST left

what are CRC errors meaning?

Also which controller card you would suggest me as raid 1?

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123#issuecomment-1737602619, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2AINHGZVNF77GIUXG3X4Q7CPANCNFSM6AAAAAA5G3JQIQ. You are receiving this because you were mentioned.Message ID: @.***>

FurkanGozukara commented 1 year ago

@Swiss3003 thank you so much for answers

I did the DST tests and no errors shown. So what do you think? By the way currently I am on new cable and new port. Haven't tested Raid 1 yet

here below

disk 1

image

disk 2

image

Swiss3003 commented 1 year ago

Like the new cable and new port. Give it a try the drive passed.

Watch the resets and the crc errors.

Tim Gilmer Staff Engineer Field Diags Office: (720)-684-2624 Seagate Technology [cid:52bc2732-1af9-4c7c-a70c-63f0f4f30ea1]

Seagate Internal


From: Furkan Gözükara @.> Sent: Wednesday, September 27, 2023 3:20 PM To: Seagate/openSeaChest @.> Cc: Tim Gilmer @.>; Mention @.> Subject: Re: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

@Swiss3003https://github.com/Swiss3003 thank you so much for answers

I did the DST tests and no errors shown. So what do you think? By the way currently I am on new cable and new port. Haven't tested Raid 1 yet

here below

disk 1

[image]https://user-images.githubusercontent.com/19240467/271118137-e7faa9b7-c490-4d4b-8a2d-cd3969dd7e3b.png

disk 2

[image]https://user-images.githubusercontent.com/19240467/271118628-76f0c7e3-981c-4145-ac1f-250a05620825.png

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123#issuecomment-1738093432, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2HXJGPLKMHW5QQZ2KDX4SKABANCNFSM6AAAAAA5G3JQIQ. You are receiving this because you were mentioned.Message ID: @.***>

FurkanGozukara commented 1 year ago

Thank you @Swiss3003 but the disks are still not working properly. Sometimes they just freeze too long

I plan to purchase this raid card and test it what do you think?

Marvell® 88SE9230

look at this image not normal at all. it just 100% active no read or write

image

and here my another sata 3 reguler hdd drive

the other hdd immediately starts read write as supposed to be

image

FurkanGozukara commented 1 year ago

ye it has been like 10 minutes still like this

how can I debug this issue?

image

FurkanGozukara commented 1 year ago

these are the 2 disks

I have 3 nvme disks slotted on my motherboard could that be reason?

it takes more than 20 minutes for benchmark to start

image

FurkanGozukara commented 1 year ago

Hello again

Any ideas how to debug this issue?

Swiss3003 commented 1 year ago

Well, if you have a slow running drive. You have all kinds of options. These are some of the top ones

1. new data cable 2. new card 3. Defrag the drive 4. Make sure you don't have a virus 5. check and repair the partition 6. fix the format.

So, the last 4 there are a lot of tools to perform these tasks. But sometime on older / used drives you might what to low level the drive to clean it up. You can do this through an erase of the drive (ATA security erase) . It writes a pattern to every LBA on the drive. Cleaning it up. After that give the drive time to perform "background "tasks. Then format the drive again and write the data to the drive again.

Now I'm still thinking you are seeing something in the interface on U: I beat the drive has dropped down the sata300 again. Check the CRC error and see if that number has increased. I also think that the command timeout count has increased. Check those numbers. I think you saw the drop in the Phy speed in your trace. IF you replace the drive with a different drive and the problem goes away. Try replacing the drive.

Now drive D: I really don't see anything standing out on the drive. But it's running slow. No CRC, No big number timeouts. Almost no ecc errors. It seems like it's still running at SATA600 for the smart pic you sent in. If it's busy all the time, then it might need to perform some background tasks. You can put the drive into idle for a few hours and see if it helps. Otherwise, I would low level the drive (ATA security erase) . If it's still slow no matter what. Try replacing the drive.

The last thing to look into would be vibration within the system. A fan or the power system is sometimes the issue for vibration. I didn't see anything in the smart that would point to vibration. That's way I suggested looking at the data cable for noise, because of the CRC errors and U: was running at SATA300 vs SATA600. Both drives did pass DST, so I really don't think it's a vibration issue and I don't think the drives are bad. I think your clues are with the U: drive. If you did replace the cable and the card, it could be the drive. The best thing I could offer is pulling of logs and do a first level fa on the drive to see. Is that something you would want to try?

Tim Gilmer Staff Engineer Field Diags Office: (720)-684-2624 Seagate Technology [cid:d08e9930-97a9-431f-a5ff-ed108b4dc933]

Seagate Internal


From: Furkan Gözükara @.> Sent: Friday, September 29, 2023 7:11 PM To: Seagate/openSeaChest @.> Cc: Tim Gilmer @.>; Mention @.> Subject: Re: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

Hello again

Any ideas how to debug this issue?

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123#issuecomment-1741614308, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2CEH3Q3FK5SQVVV7CLX45WTJANCNFSM6AAAAAA5G3JQIQ. You are receiving this because you were mentioned.Message ID: @.***>

DebabrataSTX commented 1 year ago

@FurkanGozukara , You seems to have a SW RAID 1. So it might be a good idea to look into this issue from RAID initiator perspective. After all the RAID logic is taking the decision of degrading the volume. There must be a trigger for that. If we can find that trigger then we should be able to narrow down our search. I might have started looking at the "Computer Management" -> "Event Viewer" Capture1 We need to look for the event for volume degradation. Once you find that then look for the Drive Events just before that. For both cases look at the event description to have a better understanding.

FurkanGozukara commented 1 year ago

@DebabrataSTX thank you so much

currently disks are non-raid but speed is terrible for some reason and i can't find. normally I use intel raid manager to make my raid 1

I changed cables as @Swiss3003 said

I have another HDD in same area in their previous slot working perfectly fine. this is also eliminating many other listings @Swiss3003 made

interestingly i did another test today and results are different than before

still very slow compared to what it should have been but didn't wait forever (10s of minutes) to start reading writing like before

I made this change recently

image

disk K is ST4000NM0024-1HT178 disk U and D are ST8000NM000A-2KE101

image

Swiss3003 commented 1 year ago

Furkan, As long as you are not seeing those hard resets and CRC error to tick up still. The only things I can come up with for slow drives is to erase the drive and let the drive sit with only power for two days and then try it again. You could let the drive sit for two days with no data cable it in and see if the drive runs faster.

See the drives have been doing error recovery for some time. We know that from the CRC errors, hard resets, timeout counts from the smart data. Also, the drives have spent little time in "Idle mode". Idle mode is when a drive try's to do self-repair and DOS. These are call background tasks. These tasks take time and with the error recovery that the drive was doing, could be backed up and behind in the counts. So, If you let the drive go into idle for a day or so it could finish all the background tasks and error recoveries. This would help in the speed of coming out of idle and could speed up the drive and keep the phy running at sata600 and not dropping down to sata300. Like I said "could" help.

The erase would also help this. An erase on the drive would self-clean all the LBA's and set the drive back to factory and clean all the glist and plist on the drive. returning the drive to a health state. Clearing all the cache on the drive. But the SMART logs showed you had no glist and plist and DST passed. So the only thing that the erase would do really is clear all the background tasks and reallocate any back sectors on the drive.

The only other thing that you can do is look at the setting on the drive. Just do the -i option in the tools and it will print out all the features on the drive.

To me the key is the OS is dropping the drive down to sata300. The OS is seeing errors, and it is slowing the phy down to keep the signals clean. Therefore, not seeing the errors anymore. Otherwise it would have slowed it down even more to SATA150.

So, you might want to talk to the call center for more help. I'm sure they know more than me. Good luck. Slow drive are hard to figure out.

Seagate Technology [cid:5b181e4a-920f-4b3f-b137-fd5eedc347e3]

Seagate Internal


From: Furkan Gözükara @.> Sent: Tuesday, October 3, 2023 3:33 AM To: Seagate/openSeaChest @.> Cc: Tim Gilmer @.>; Mention @.> Subject: Re: [Seagate/openSeaChest] I have got dual ST8000NM000A-2KE101 - they have 0 bad sectors and errors but Raid 1 keep getting degraded - Intel® Optane™ Memory (Issue #123)

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

@DebabrataSTXhttps://github.com/DebabrataSTX thank you so much

currently disks are non-raid but speed is terrible for some reason and i can't find

I changed cables as @Swiss3003https://github.com/Swiss3003 said

I have another HDD in same area in their previous slot working perfectly fine. this is also eliminating many other listings @Swiss3003https://github.com/Swiss3003 made

interestingly i did another test today and results are different than before

still very slow compared to what it should have been but didn't wait forever (10s of minutes) to start reading writing like before

I made this change recently

[image]https://user-images.githubusercontent.com/19240467/272215195-98c13551-fb88-48a7-9f4b-85dbd8fab8d4.png

disk K is ST4000NM0024-1HT178 disk U and D are ST8000NM000A-2KE101

[image]https://user-images.githubusercontent.com/19240467/272215660-749834b6-0aaf-484a-8f5b-458020973f4a.png

— Reply to this email directly, view it on GitHubhttps://github.com/Seagate/openSeaChest/issues/123#issuecomment-1744591977, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIA3L2EJAXDLKNXHUCIFOALX5PLXBAVCNFSM6AAAAAA5G3JQISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBUGU4TCOJXG4. You are receiving this because you were mentioned.Message ID: @.***>

FurkanGozukara commented 1 year ago

@Swiss3003 thank you so much for reply

They have been idle all the time since i opened this thread

But I didn't remove cable

I just did a test and still slow but at least directly starts working

image

Also intel raid manager is displaying 6 GBs . so i think sata 3?

image

Also what technique do you suggest to erase disks? self-clean

the disks are so new but i want to give it a try

DebabrataSTX commented 12 months ago

@FurkanGozukara , Just to add, you can download SeaChest_Erase from https://www.seagate.com/content/dam/seagate/migrated-assets/old-support-files/seachest/SeaChestUtilities.zip

FurkanGozukara commented 12 months ago

@FurkanGozukara , Just to add, you can download SeaChest_Erase from https://www.seagate.com/content/dam/seagate/migrated-assets/old-support-files/seachest/SeaChestUtilities.zip

i have it what command?

DebabrataSTX commented 12 months ago

Can you summarize exactly what operation/functions you want to do with the drive. That way it will be easy for me to send the right command(s).

FurkanGozukara commented 12 months ago

An erase on the drive would self-clean all the LBA's and set the drive back to factory and clean all the glist and plist on the drive. returning the drive to a health state.

I want to do what @Swiss3003 said

An erase on the drive would self-clean all the LBA's and set the drive back to factory and clean all the glist and plist on the drive. returning the drive to a health state.

DebabrataSTX commented 12 months ago

Try SeaChest_Erase -d /dev/sg<#> --sanitize overwrite --poll

FurkanGozukara commented 11 months ago

Try SeaChest_Erase -d /dev/sg<#> --sanitize overwrite --poll

openSeaChest_Erase.exe -d PD3 --sanitize overwrite --poll --confirm this-will-erase-data

Thanks started it lets see what happens

image

after this what command do you suggest me to check full health of the drive please give me command thank you

FurkanGozukara commented 11 months ago

after sanitize disk is empty

i will also post screenshot after i copied all files to the disk hopefully

image

and which command to check disk health?

FurkanGozukara commented 11 months ago

ok after data copied here the new speed

right disk is sanitized + data copied

left disk not sanitized yet

image

FurkanGozukara commented 11 months ago

sanitized second disk too. now will copy data and test again

image

FurkanGozukara commented 11 months ago

I did build the Raid 1

Here latest results. I hope I don't get any error

image