CTSRD-CHERI / cheribsd

FreeBSD adapted for CHERI-RISC-V and Arm Morello.
http://cheribsd.org
Other
169 stars 60 forks source link

More Morello SATA woes #1391

Open nwf-msr opened 2 years ago

nwf-msr commented 2 years ago

One of our boards just died mid-way through a pkg64 install and exclaimed, on its way down,

ahcich0: Timeout on slot 30 port 0
ahcich0: is 00000000 cs c0000000 ss 00000000 rs c0000000 tfd 50 serr 00000000 cmd 00719e17
(ada0:ahcich0:0:0:0): WRITE_DMA48. ACB: 35 00 28 a1 39 40 15 00 00 00 40 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
ahcich0: Timeout on slot 0 port 0
ahcich0: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 00000000 rs 00000002 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 2 port 0
ahcich0: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF400400> s/n 214607A00DB1 detached
GEOM_ELI: g_eli_write_done() failed (error=6) ada0.eli[WRITE(offset=182322188288, length=16384)]
[GEOM_ELI sadness continues]

This is running with 2d16d483e9efb94c81dcd91dd50fb0ad1b926735 (just a few days stale) and so includes the NONCQ quirk from 2b294a1e5bbde22e5ce6628c3a86db97205c1aa7. I don't know if that's informative or just evidence of a cosmic ray striking at the wrong moment. (I seem to have left my SATA magic decoder ring in my other robe.)

The good news, I suppose, is that it seems to have cleared up on reboot?

rwatson commented 2 years ago

Tagging @bsdimp, @zxombie, @jrtc27.

rwatson commented 2 years ago

Probably relates to #1343.

bsdimp commented 2 years ago

This looks to be a hardware problem and/or subtle incompatibility with the ahci driver. Commands are timing out, including the most basic command identify. Assuming the hard disk's firmware isn't crashing (possible, but extremely unlikely), this looks to be an issue with command completions either not being recognized by the ahci controller, or interrupts aren't being delivered. It appears, however, that the timeout code for ahci will complete commands that didn't interrupt for some reason (though my reading could be wrong, I'm not super familiar with the low-level details of ahci).

nwf-msr commented 2 years ago

For whatever it's worth, after a long while of this not happening, one of our boards just spat out

ahcich0: Timeout on slot 15 port 0
ahcich0: is 00000000 cs 00018000 ss 00000000 rs 00018000 tfd 50 serr 00000000 cmd 00718f17
(ada0:ahcich0:0:0:0): WRITE_DMA48. ACB: 35 00 e8 5f f2 40 1a 00 00 00 30 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
ahcich0: Timeout on slot 17 port 0
ahcich0: is 00000000 cs 00020000 ss 00000000 rs 00020000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 18 port 0
ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted

which looks similar to the above, but again, I'm not sure if the differences are significant or this is just another "something went wrong on the SATA bus, perhaps the device crashed".

This is with the 2022.05 release hybrid kernel.

arichardson commented 2 years ago

I also just saw this again for the first time in months when rsync'ing a large number of files to /opt (and it happened again when I retried the rsync after rebooting).

ahcich0: Timeout on slot 16 port 0
ahcich0: is 00000000 cs 00030000 ss 00000000 rs 00030000 tfd 50 serr 00000000 cmd 00719017
(ada0:ahcich0:0:0:0): WRITE_DMA48. ACB: 35 00 68 8a 3c 40 08 00 00 00 00 08
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
ahcich0: Timeout on slot 18 port 0
ahcich0: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 19 port 0
ahcich0: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 20 port 0
ahcich0: is 00000000 cs 00100000 ss 00000000 rs 00100000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF510000> s/n 21374Y448403 detached
g_vfs_done(): ada0p2 converting all errors to ENXIO
g_vfs_done():ada0p2[WRITE(offset=70479282176, length=1048576)]error = 6 supressing further ENXIO
UFS: forcibly unmounting /dev/ada0p2 from /
ahcich0: Timeout on slot 21 port 0
ahcich0: is 00000000 cs 00200000 ss 00000000 rs 00200000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Retrying command, 0 more tries remain
ahcich0: Timeout on slot 22 port 0
ahcich0: is 00000000 cs 00400000 ss 00000000 rs 00400000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Poll timeout on slot 23 port 0
ahcich0: is 00000000 cs 00800000 ss 00000000 rs 00800000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 24 port 0
ahcich0: is 00000000 cs 03000000 ss 00000000 rs 03000000 tfd 150 serr 00000000 cmd 00718017
ahcich0:  ... waiting for slots 02000000
ahcich0: Timeout on slot 25 port 0
ahcich0: is 00000000 cs 03000000 ss 00000000 rs 03000000 tfd 150 serr 00000000 cmd 00718017
(ada0:ahcich0:0:0:0): WRITE_DMA48. ACB: 35 00 68 8a 3c 40 08 00 00 00 00 08
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 6, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_DMA. ACB: ca 00 50 c9 08 40 00 00 00 00 12 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 6, Periph was invalidated
ahcich0: Poll timeout on slot 26 port 0
ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
pid 841 (sshd), jid 0, uid 0: exited on signal 4
pid 843 (rsync), jid 0, uid 0: exited on signal 4
pid 797 (cron), jid 0, uid 0: exited on signal 4
pid 790 (sshd), jid 0, uid 0: exited on signal 4
pid 459 (dhclient), jid 0, uid 65: exited on signal 4
pid 381 (dhclient), uid (0): Failed to open coredump file 'dhclient.core', error=20
pid 381 (dhclient), jid 0, uid 0: exited on signal 4
pid 378 (dhclient), jid 0, uid 0: exited on signal 4
ahcich0: Timeout on slot 27 port 0
ahcich0: is 00000000 cs 08000000 ss 00000000 rs 08000000 tfd 150 serr 00000000 cmd 00718017
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE RCACHE. ACB: ef aa 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 6, Periph was invalidated
ahcich0: Poll timeout on slot 28 port 0
ahcich0: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
ahcich0: Timeout on slot 29 port 0
ahcich0: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 150 serr 00000000 cmd 00718017
(ada0:ahcich0:0:0:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 6, Periph was invalidated
(ada0:ahcich0:0:0:0): Periph destroyed
ahcich0: Poll timeout on slot 30 port 0
ahcich0: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd 150 serr 00000000 cmd 00718017
(aprobe0:ahcich0:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
pid 827 (getty), uid (0): Failed to open coredump file 'getty.core', error=20
pid 827 (getty), jid 0, uid 0: exited on signal 4