canonical / hotsos

Software analysis toolkit. Define checks in high-level language and leverage library to perform analysis of common Cloud applications.
Apache License 2.0
33 stars 38 forks source link

Add a scenario for reporting "critical medium error " disk failures #883

Closed xmkg closed 5 months ago

xmkg commented 6 months ago

The kernel reports if it encounters trouble reading/writing to a disk sector, so we can leverage it to report possible disk failures.

Example kern.log:

2023-11-11T06:30:30.786134+05:30 host-name kernel: [1490644.477294] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2023-11-11T06:30:30.786156+05:30 host-name kernel: [1490644.477304] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
2023-11-11T06:30:30.786159+05:30 host-name kernel: [1490644.477308] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
2023-11-11T06:30:30.786161+05:30 host-name kernel: [1490644.477313] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 85 4f 28 00 00 08 00
2023-11-11T06:30:30.786163+05:30 host-name kernel: [1490644.477316] print_req_error: critical medium error, dev sdb, sector 8736552
2023-11-11T06:30:30.845875+05:30 host-name kernel: [1490644.537055] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2023-11-11T06:30:30.845893+05:30 host-name kernel: [1490644.537058] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current]
2023-11-11T06:30:30.845896+05:30 host-name kernel: [1490644.537060] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error
2023-11-11T06:30:30.845900+05:30 host-name kernel: [1490644.537062] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 85 1e 38 00 00 08 00
2023-11-11T06:30:30.845904+05:30 host-name kernel: [1490644.537064] print_req_error: critical medium error, dev sdb, sector 8724024
2023-11-11T06:30:30.971339+05:30 host-name kernel: [1490644.662495] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
pponnuvel commented 6 months ago

This is already covered by: https://github.com/canonical/hotsos/blob/main/hotsos/defs/scenarios/kernel/disk_failure.yaml

xmkg commented 6 months ago

@pponnuvel

Ah, I see. Oddly, the scenario did not report it even though it's present in the /var/log/kern.log. I believe it has something to do with the kern.log lines being a little bit different than the pattern is designed to match:

<6>2024-04-17T00:26:33.540821+05:30 host-name kernel: [   29.751657] i40e 0000:5d:00.0 eth5: Changing Rx descriptor count from 512 to 4096
<6>2024-04-17T00:26:33.540822+05:30 host-name kernel: [   29.782098] sd 1:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<6>2024-04-17T00:26:33.540823+05:30 host-name kernel: [   29.782101] sd 1:0:0:0: [sdb] tag#6 Sense Key : Medium Error [current] 
<6>2024-04-17T00:26:33.540829+05:30 host-name kernel: [   29.782104] sd 1:0:0:0: [sdb] tag#6 Add. Sense: Unrecovered read error
<6>2024-04-17T00:26:33.540831+05:30 host-name kernel: [   29.782107] sd 1:0:0:0: [sdb] tag#6 CDB: Read(10) 28 00 00 85 1f 80 00 00 80 00
<3>2024-04-17T00:26:33.540832+05:30 host-name kernel: [   29.782109] print_req_error: critical medium error, dev sdb, sector 8724376

The regex pattern does not match any of the lines above, as can be seen here

pponnuvel commented 6 months ago

Seems the kern.log's timestamp pattern isn't fixed. The following are all valid:

Jun  8 10:48:13 
Jun 08 10:48:13
2024-04-17T00:26:33.540832+05:30

Perhaps we could drop the regex that looks at timestamp.