canonical / hotsos

Software analysis toolkit. Define checks in high-level language and leverage library to perform analysis of common Cloud applications.
Apache License 2.0
33 stars 38 forks source link

Detect scsi errors in kernel or journal logs #631

Open nkshirsagar opened 1 year ago

nkshirsagar commented 1 year ago

Sosreports on SF00361947 had kernel logs that were full of scsi I/O errors and controller errors like these. hotsos should detect these. The BRCM errors indicate an underrun, i.e 0x2000 bytes were expected by the device driver but 0x0 were received back from the controller.

Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2713 BRCM Debug mfi stat 0x2d, data len requested/completed 0x2000/0x0

The blk_update_request error is an indication of an I/O issue and should be flagged.

Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: blk_update_request: I/O error, dev sdh, sector 2738471384 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0

And this are scsi errors, which should be reported by hotsos,

Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2712 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2713 Sense Key : Aborted Command [current]
Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2713 Add. Sense: No additional sense information
Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2713 CDB: Write(10) 2a 00 a3 34 a5 20 00 00 10 00

These I/O errors also should be caught, where it prints the CDB,

Jun 06 10:18:14 sf-sby001-hc0120-rack013 kernel: sd 0:0:11:0: [sdh] tag#2712 CDB: Write(10) 2a 00 a3 39 c1 d8 00 00 08 00
pponnuvel commented 7 months ago

Note that https://github.com/canonical/hotsos/pull/797 will catch all block I/O errors including the ones that result from SCSI errors.

I checked SF00361947 and it should be caught by #797 too. So I think capturing SCSI errors isn't really necessary anymore unless I am missing any situations where SCSI errors could occur (and we want to know that) but not trigger a "blk_update_request" error.

mfoliveira commented 7 months ago

Hi @pponnuvel Checking this per your request in MM.

797 should catch only the blk_update_request lines, which should be sufficient for most cases (but I'm afraid the heading regex might not match in some cases if customers change the language/locale, so month names might differ?).

However, it would not catch SCSI errors that are generated for other reason than that (e.g., from SCSI controller software that send passthrough SCSI commands independently of the block layer) or if for some reason the SCSI commands do not directly generate errors in the block layer / blk_update_request().)

The driver-specific message might be worth checking as well, but that should be rarer than the generic kernel messages for SCSI/block I/O errors.

pponnuvel commented 7 months ago

Thanks @mfoliveira for looking into this!

I am re-opening this to look into non-blk_update_request SCSCI errors.

We'd need some real-world examples though because:

  1. We'd need examples to write test cases
  2. hotsos isn't designed to catch every possible error that might happen at one point or other.

Can make up something for (1). But it's probably best to not over-scenario-fy hotsos with very rare errors as I said in (2).

mfoliveira commented 7 months ago

Hey @pponnuvel

Right, I'm aware. I meant the SCSI errors in the issue description that were not checked for in the PR, as you suggested the PR would address this issue (which shows SCSI errors, and the resulting blk_update_request errors), but it only checked for the blk_update_request errors, and not the SCSI errors.

As a real world example, I'd take the line in the description which reads FAILED Result: hostbyte=* driverbyte=* from scsi_print_result() in drivers/scsi/scsi_logging.c