baruch / diskscan

Scan disk for bad or near failure sectors, performs disk diagnostics
GNU General Public License v3.0
108 stars 29 forks source link

Speed up the 'scan then fix' use case #33

Closed g2p closed 9 years ago

g2p commented 9 years ago

One of my diskscans (without --fix) stopped at the first error:

E: IO failed with no sense: status=2 mask=1 driver=8 msg=0 host=0 E: Error when reading at offset 123456789 size 65536 read 65536: Success E: Details: error=fatal data=full 00/00/00 E: Fatal error occurred, bailing out.

The kernel log has more info (uncorrectable sector), I don't know how much is passed to userland (not through "sense" apparently).

[13645.452988] ata6.00: exception Emask 0x0 SAct 0x800000 SErr 0x0 action 0x0 [13645.452992] ata6.00: irq_stat 0x40000008 [13645.452995] ata6.00: failed command: READ FPDMA QUEUED [13645.453000] ata6.00: cmd 60/80:b8:a9:7e:3d/00:00:82:00:00/40 tag 23 ncq 65536 in [13645.453003] ata6.00: status: { DRDY ERR } [13645.453005] ata6.00: error: { UNC } [13645.465194] ata6.00: configured for UDMA/133 [13645.465214] ata6: EH complete

I would prefer if the scan kept running in this case; this is a large disk and scanning good sectors again takes time.

Also, it would be extra helpful if --fix mode could reuse the log to start with error events and high-latency ranges. This way the total runtime for scanning then fixing could be halved.

baruch commented 9 years ago

What platform are you on? Linux or a BSD? What is the symlink arch/arch.c pointing to? On Linux I diskscan issues scsi commands and should get the full sense buffer so it's strange that it got no info.

I'll also take a look at why it stopped on error, I agree it should continue to test and report on the entire disk.

baruch commented 9 years ago

Looks like you are on Linux but for some reason the SATA driver didn't translate the error to a SCSI sense and only returned it as a driver error. Can you provide the output of lspci? This will tell me what is the SATA controller so I'll know about it.

I currently default to make any unknown error a fatal one and this stops the scan since fatal errors mean there is no reason to continue the scan (disk fully dead). I'll think about this and see what I can do.

baruch commented 9 years ago

Made unknown errors non-fatal, can you please test it and let me know if it works out for you?

g2p commented 9 years ago

Testing, will report back in a few hours.

Relevant lspci bits:

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA AHCI Controller (rev 05) (prog-if 01 [AHCI 1.0])
    Subsystem: ASUSTeK Computer Inc. P8 series motherboard
    Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 46
    I/O ports at f070 [size=8]
    I/O ports at f060 [size=4]
    I/O ports at f050 [size=8]
    I/O ports at f040 [size=4]
    I/O ports at f020 [size=32]
    Memory at f6306000 (32-bit, non-prefetchable) [size=2K]
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [70] Power Management version 3
    Capabilities: [a8] SATA HBA v1.0
    Capabilities: [b0] PCI Advanced Features
    Kernel driver in use: ahci
baruch commented 9 years ago

The SATA controller is the same one I have, though on my own disk there were no uncorrectable errors for me to see this phenomenon. It would be non-trivial to be able to test this though, will need to delve into READ LONG and WRITE LONG and hope there is an equivalent in SATA for them.

g2p commented 9 years ago

With this change I was able to complete the scan. Thank you.

baruch commented 9 years ago

Great!