Closed xmkg closed 5 months ago
This is already covered by: https://github.com/canonical/hotsos/blob/main/hotsos/defs/scenarios/kernel/disk_failure.yaml
@pponnuvel
Ah, I see. Oddly, the scenario did not report it even though it's present in the /var/log/kern.log
. I believe it has something to do with the kern.log lines being a little bit different than the pattern is designed to match:
<6>2024-04-17T00:26:33.540821+05:30 host-name kernel: [ 29.751657] i40e 0000:5d:00.0 eth5: Changing Rx descriptor count from 512 to 4096
<6>2024-04-17T00:26:33.540822+05:30 host-name kernel: [ 29.782098] sd 1:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
<6>2024-04-17T00:26:33.540823+05:30 host-name kernel: [ 29.782101] sd 1:0:0:0: [sdb] tag#6 Sense Key : Medium Error [current]
<6>2024-04-17T00:26:33.540829+05:30 host-name kernel: [ 29.782104] sd 1:0:0:0: [sdb] tag#6 Add. Sense: Unrecovered read error
<6>2024-04-17T00:26:33.540831+05:30 host-name kernel: [ 29.782107] sd 1:0:0:0: [sdb] tag#6 CDB: Read(10) 28 00 00 85 1f 80 00 00 80 00
<3>2024-04-17T00:26:33.540832+05:30 host-name kernel: [ 29.782109] print_req_error: critical medium error, dev sdb, sector 8724376
The regex pattern does not match any of the lines above, as can be seen here
Seems the kern.log's timestamp pattern isn't fixed. The following are all valid:
Jun 8 10:48:13
Jun 08 10:48:13
2024-04-17T00:26:33.540832+05:30
Perhaps we could drop the regex that looks at timestamp.
The kernel reports if it encounters trouble reading/writing to a disk sector, so we can leverage it to report possible disk failures.
Example kern.log: