AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.4k stars 172 forks source link

[FEAT] Allow a delay or different schedules between multiple disks checks #706

Closed pabsi closed 4 days ago

pabsi commented 3 weeks ago

Is your feature request related to a problem? Please describe. The particular issue arises when running a smart check over multiple disks which are connected USB-to-SATA. In my specific case, I have the Quad SATA Hat for the Pi 4, meaning 4 sata disks are connected via 2 USB 3.0 ports. Sometimes when running the smart checks against all 4 drives at once, the USB connection gets reset, and this, in my case, makes the mdadm RAID array fail and mark the devices as failed, and thus removing them from the array. Not a real issue, since I can --re-add them later. But it's very inconvenient. Moreover if the smart checks are run daily. See example of dmesg logs:

[Wed Oct 30 04:00:05 2024] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Wed Oct 30 04:00:06 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Thu Oct 31 04:00:06 2024] usb 2-2: reset SuperSpeed USB device number 2 using xhci_hcd
[Thu Oct 31 04:00:06 2024] sd 0:0:0:1: [sdb] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=0s
[Thu Oct 31 04:00:06 2024] sd 0:0:0:1: [sdb] tag#0 CDB: opcode=0x85 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[Thu Oct 31 04:00:07 2024] usb 2-2: USB disconnect, device number 2
[Thu Oct 31 04:00:07 2024] md: super_written gets error=-5
[Thu Oct 31 04:00:07 2024] md/raid10:md0: Disk failure on sdb1, disabling device.
                           md/raid10:md0: Operation continuing on 3 devices.
[Thu Oct 31 04:00:07 2024] md: super_written gets error=-5
[Thu Oct 31 04:00:07 2024] md/raid10:md0: Disk failure on sda1, disabling device.
                           md/raid10:md0: Operation continuing on 2 devices.
[Thu Oct 31 04:00:07 2024] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Thu Oct 31 04:00:12 2024] usb 2-2: new SuperSpeed USB device number 4 using xhci_hcd
[Thu Oct 31 04:00:12 2024] usb 2-2: New USB device found, idVendor=1058, idProduct=0a10, bcdDevice=81.36
[Thu Oct 31 04:00:12 2024] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=5
[Thu Oct 31 04:00:12 2024] usb 2-2: Product: JMS56x Series
[Thu Oct 31 04:00:12 2024] usb 2-2: Manufacturer: JMicron
[Thu Oct 31 04:00:12 2024] usb 2-2: SerialNumber: 1234567890123
[Thu Oct 31 04:00:12 2024] usb 2-2: UAS is ignored for this device, using usb-storage instead
[Thu Oct 31 04:00:12 2024] usb 2-2: UAS is ignored for this device, using usb-storage instead
[Thu Oct 31 04:00:12 2024] usb-storage 2-2:1.0: USB Mass Storage device detected
[Thu Oct 31 04:00:12 2024] usb-storage 2-2:1.0: Quirks match for vid 1058 pid 0a10: 800000
[Thu Oct 31 04:00:12 2024] scsi host0: usb-storage 2-2:1.0
[Thu Oct 31 04:00:13 2024] scsi 0:0:0:0: Direct-Access     Samsung  SSD 850 EVO 2TB  8136 PQ: 0 ANSI: 6
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: Attached scsi generic sg0 type 0
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Write Protect is off
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Mode Sense: 47 00 10 08
[Thu Oct 31 04:00:13 2024] scsi 0:0:0:1: Direct-Access     CT2000BX 500SSD1          8136 PQ: 0 ANSI: 6
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] No Caching mode page found
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Assuming drive cache: write through
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: Attached scsi generic sg1 type 0
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Write Protect is off
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Mode Sense: 47 00 10 08
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] No Caching mode page found
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Assuming drive cache: write through
[Thu Oct 31 04:00:13 2024]  sda: sda1
[Thu Oct 31 04:00:13 2024] sd 0:0:0:0: [sda] Attached SCSI disk
[Thu Oct 31 04:00:13 2024]  sdb: sdb1
[Thu Oct 31 04:00:13 2024] sd 0:0:0:1: [sdb] Attached SCSI disk

I also say "sometimes" because there are times that despite running the 4 drives checks at once, it doesn't disconnect them. But I also experienced more stability when running the smart checks one by one, disk by disk, with a certain delay (just a bunch of seconds normally does the job).

Describe the solution you'd like A possible option would be to have some environment variable (e.g. DELAY_BETWEEN_DISK_CHECKS or whatever, naming is hard). Another option would be to offer a schedule per drive, but I think this would be way more engineering for perhaps a very specific problem not everyone has.

I would do it myself, but unfortunately I am not savvy enough on Go :(

Additional context N/A

Other notes Thank you so much for your work. Really appreciate it :1st_place_medal:

pabsi commented 3 weeks ago

Could probably just a matter of adding a sleep of some sort based on that ENV var I suggested, in this for loop? https://github.com/AnalogJ/scrutiny/blob/master/collector/pkg/collector/metrics.go#L87

pabsi commented 3 weeks ago

Revisiting the code, I just realised there's a TODO in the code about this very same topic :sweat_smile: : https://github.com/AnalogJ/scrutiny/blob/master/collector/pkg/collector/metrics.go#L93-L94

AnalogJ commented 2 weeks ago

Hey @pabsi I'd be happy to consider a change like this, if its optional and configurable via the collector config yaml file.

Can you open a PR?

pabsi commented 2 weeks ago

I can try :)

As I said on the original post:

I would do it myself, but unfortunately I am not savvy enough on Go :(

But I'll give it a go ;)

pabsi commented 2 weeks ago

@AnalogJ I can't raise a PR. GitHub threw me an error about not being a contributor.

You can see what I did here: https://github.com/AnalogJ/scrutiny/compare/AnalogJ:master...pabsi:706-add-wait-time-between-checks?expand=1

The test for the collector (go run collector/cmd/collector-metrics/collector-metrics.go run --debug worked fine).

Regards.

pabsi commented 6 hours ago

Really sorry to bug you @AnalogJ but I had to submit a small fix for this PR: https://github.com/AnalogJ/scrutiny/pull/725

Thank you :pray: