baruch / diskscan

Scan disk for bad or near failure sectors, performs disk diagnostics
GNU General Public License v3.0
108 stars 29 forks source link

Fatal error occurred, bailing out #49

Closed micah closed 9 years ago

micah commented 9 years ago

Hi,

I have two disks, on two different systems, that are reporting Current_Pending_Sector count of 1 in smart, so I decided to try your tool to fix it, but it seemed like it failed :(

root@raggiana-pn:~# diskscan -vf /dev/sdd
diskscan version HEAD

V: Verbosity set
I: Validating path /dev/sdd
I: Opened disk /dev/sdd sector size 512 num bytes 1000204885504
I: Scanning disk /dev/sdd in 65536 byte steps
I: Scan started at: Mon Nov  2 09:01:27 2015

V: Scanning stride starting at 0 done 0%
V: Scanning stride starting at 14288641536 done 1%
V: Scanning stride starting at 28577283072 done 2%
V: Scanning stride starting at 42865924608 done 4%
V: Scanning stride starting at 57154566144 done 5%
V: Scanning stride starting at 71443207680 done 7%
V: Scanning stride starting at 85731849216 done 8%
V: Scanning stride starting at 100020490752 done 10%
V: Scanning stride starting at 114309132288 done 11%
V: Scanning stride starting at 128597773824 done 12%
V: Scanning stride starting at 142886415360 done 14%
V: Scanning stride starting at 157175056896 done 15%
V: Scanning stride starting at 171463698432 done 17%
V: Scanning stride starting at 185752339968 done 18%
V: Scanning stride starting at 200040981504 done 20%
V: Scanning stride starting at 214329623040 done 21%
V: Scanning stride starting at 228618264576 done 22%
V: Scanning stride starting at 242906906112 done 24%
V: Scanning stride starting at 257195547648 done 25%
V: Scanning stride starting at 271484189184 done 27%
V: Scanning stride starting at 285772830720 done 28%
V: Scanning stride starting at 300061472256 done 30%
V: Scanning stride starting at 314350113792 done 31%
E: IO failed with no sense: status=2 mask=1 driver=8 msg=0 host=0
E: Error when reading at offset 318513026048 size 65536 read 65536: Success
E: Details: error=fatal data=full 00/00/00
E: Fatal error occurred, bailing out.
Access time histogram:
       1: 4620142
      10: 239740
     100: 187
     500: 68
    1000: 0
    2000: 0
    3000: 0
    4000: 0
    5000: 0
    6000: 0
    7000: 0
    8000: 0
    9000: 0
   10000: 0
   15000: 0
   20000: 0
   25000: 0
   30000: 0
above that: 0
  300 |
      |   ^     ^     ^  ^ ^^
      |      ^     ^
      |
      |
  250 |
      |
      |
      |          ^
      |                 ^
  200 |
      |
      |
      |
      |
  150 |
      |
      |
      |
      |
  100 |
      |
      |
      |
      |
   50 |           ^
      |                       ^
      | ^^ ^^ ^^    ^^ ^  ^  ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      | **********************************************************************
      | ______________________________________________________________________
      +-----------------------------------------------------------------------
Conclusion: failed
I: Scan ended at: Mon Nov  2 09:40:16 2015

I: Scan took 2329 second
I: Closed disk /dev/sdd
root@raggiana-pn:~# 
baruch commented 9 years ago

Can you try to run it with -r and see if there are more details on the error?

Also, what do you have in dmesg after the test? Is there any OS logged errors?

micah commented 9 years ago

Thanks for the quick response Baruch!

A couple possibilities here:

  1. I was running 0.14.1-6 from jessie... so I've just built and am running the new 0.17 release, and will report back when it has finished... its already found the uncorrected block and successfully remapped it, resolving the pending sector reallocation count... that is further than it got before.
  2. Because diskscan failed, I decided to try and run badblocks destructive write test to see if it would map out the sector, when I would run it, it would complain:

/dev/sdd is apparently in use by the system; it's not safe to run badblocks!

and not run... I had failed that device out of the raid array, confirmed it wasn't in swap, mounted anywhere, part of any LV, or open device mapping... but then I figured out that I had to more than just fail it in the raid array, I needed to also remove it (mdadm --manage --remove /dev/md4 /dev/sdd1) and then badblocks would run.

So, its possible that diskscan failed because the device was opened? If so, then maybe a check should be added to diskscan to see if its open before allowing you to continue with the fix option? The other possibility is that it failed because it was an older version of diskscan!

Since we are talking about this - when you pass the -f option to have diskscan fix the problems, this is effectively writing to that block to force the drive to reallocate it, right? I'm just trying to get an idea of how destructive this option is (compared to a badblocks destructive write test), it seems like it might be isolated to that specific spot and shouldn't really cause any damage that hasn't already been done by the disk failure?

baruch commented 9 years ago

There was such an old bug, it should work with 0.17.

The recover works for correctable errors by reading and rewriting, for uncorrectable errors it will just write zeros on it. Currently it works on a 64k block but I want to make it more granuler for uncorrectable errors so we don't zero a block if it is readable.

I also need to add the part where I verify the partition is not in use somehow. It's another ticket I have logged for myself but haven't yet implemented.

micah commented 9 years ago

It does seem to work fine in 0.17.

I got a failure message after it ran ("Conclusion: failed due to IO errors"), but I think that is because it re-mapped out a bad sector. Running it a second time didn't give me that message.

baruch commented 9 years ago

In that case it's fine now. You should note that sometimes the issue may return later, if that happens I wouldn't wait for it to happen a third time normally and just backup all the data and replace the drive. If it happens only once or only after a very long time (> 6 months) you can consider it a random behavior, if it happens faster than that you should assume (IMNSHO) that the disk is going to die at some point in the future and you most likely don't want to wait for it.