Open nkshirsagar opened 1 year ago
Note that https://github.com/canonical/hotsos/pull/797 will catch all block I/O errors including the ones that result from SCSI errors.
I checked SF00361947 and it should be caught by #797 too. So I think capturing SCSI errors isn't really necessary anymore unless I am missing any situations where SCSI errors could occur (and we want to know that) but not trigger a "blk_update_request" error.
Hi @pponnuvel Checking this per your request in MM.
blk_update_request
lines, which should be sufficient for most cases (but I'm afraid the heading regex might not match in some cases if customers change the language/locale, so month names might differ?).However, it would not catch SCSI errors that are generated for other reason than that (e.g., from SCSI controller software that send passthrough SCSI commands independently of the block layer) or if for some reason the SCSI commands do not directly generate errors in the block layer / blk_update_request()
.)
The driver-specific message might be worth checking as well, but that should be rarer than the generic kernel messages for SCSI/block I/O errors.
Thanks @mfoliveira for looking into this!
I am re-opening this to look into non-blk_update_request SCSCI errors.
We'd need some real-world examples though because:
Can make up something for (1). But it's probably best to not over-scenario-fy hotsos with very rare errors as I said in (2).
Hey @pponnuvel
Right, I'm aware.
I meant the SCSI errors in the issue description that were not checked for in the PR, as you suggested the PR would address this issue (which shows SCSI errors, and the resulting blk_update_request
errors), but it only checked for the blk_update_request
errors, and not the SCSI errors.
As a real world example, I'd take the line in the description which reads FAILED Result: hostbyte=* driverbyte=*
from scsi_print_result()
in drivers/scsi/scsi_logging.c
Sosreports on SF00361947 had kernel logs that were full of scsi I/O errors and controller errors like these. hotsos should detect these. The BRCM errors indicate an underrun, i.e 0x2000 bytes were expected by the device driver but 0x0 were received back from the controller.
The blk_update_request error is an indication of an I/O issue and should be flagged.
And this are scsi errors, which should be reported by hotsos,
These I/O errors also should be caught, where it prints the CDB,