baruch / diskscan

Scan disk for bad or near failure sectors, performs disk diagnostics
GNU General Public License v3.0
108 stars 29 forks source link

monitor commands causing POR #70

Open ziegi opened 8 months ago

ziegi commented 8 months ago

I am running diskscan 0.19 (tried also master and 0.20) on Debian 10 kernel 5.8 and Debian 12 kerneln 6.5 accessing SATA disks (6-16 TB Seagate, WD, Toshiba) attached to an LSI SAS Adapter through the Linux mpt3sas driver

Each time one of the code functions (maybe more ?) in lib/diskscan.c

static void disk_ata_monitor_start(disk_t *disk)
static void disk_ata_monitor(disk_t *disk)

is executed the drive does a POR because a command times out

kernel: sd 0:0:1:0: attempting task abort!scmd(0x00000000bfee609e), outstanding for 62048 ms & timeout 60000 ms
kernel: sd 0:0:1:0: [sdb] tag#3615 CDB: ATA command pass through(12)/Blank a1 0c 0e d0 01 00 4f c2 00 b0 00 00
kernel: scsi target0:0:1: handle(0x001a), sas_address(0x300605b012dd2901), phy(1)
kernel: scsi target0:0:1: enclosure logical id(0x300605b012112900), slot(8) 
kernel: scsi target0:0:1: enclosure level(0x0000), connector name( C2  )
kernel: sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000bfee609e)
kernel: mpt3sas_cm0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)
kernel: sd 0:0:1:0: Power-on or device reset occurred

I tried increasing the timeouts but with no success. So i am using the following workaround to exclude a drive POR from the errors:

--- diskscan-0.20/lib/diskscan.c    2017-08-25 21:24:14.000000000 +0200
+++ ../diskscan-0.20/lib/diskscan.c 2024-01-10 11:30:21.933342563 +0100
@@ -498,7 +498,8 @@
    data_log(&disk->data_log, offset/disk->sector_size, data_size/disk->sector_size, &io_res, t);

    // Handle error or incomplete data
-   if (io_res.data != DATA_FULL || io_res.error != ERROR_NONE) {
+   if ((io_res.data != DATA_FULL || io_res.error != ERROR_NONE) 
+       && !(errno == 0 && io_res.info.sense_key == 0x06 && io_res.info.asc == 0x29 && io_res.info.ascq == 0x00) /* ignore POR */) {
        int s_errno = errno;
        ERROR("Error when reading at offset %" PRIu64 " size %d read %zd, errno=%d: %s", offset, data_size, ret, errno, strerror(errno));
        ERROR("Details: error=%s data=%s %02X/%02X/%02X", error_to_str(io_res.error), data_to_str(io_res.data),

I guess there is a better solution for this by changing the ata_monitor commands, unfortunately I do not know how.

baruch commented 6 months ago

If the drives hit a timeout and do a reset that's not something that should be skipped and ignored.