Open rwatson opened 2 years ago
Original device-driver attachment information:
ahci0: <Marvell 88SE9170 AHCI SATA controller> port 0x1018-0x101f,0x1024-0x1027,0x1010-0x1017,0x1020-0x1023,0x1000-0x100f mem 0x60200000-0x602001ff at device 0.0 on pci6
ahci0: AHCI v1.00 with 2 6Gbps ports, Port Multiplier supported with FBS
ahci0: quirks=0x1000000<IOMMU_BUSWIDE>
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
...
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF400400> ACS-3 ATA SATA 3.x device
ada0: Serial Number 214607A017B6
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 228936MB (468862128 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>
(This is, presumably, two different problems: One involving AHCI failure issues, and the second involving an undesirable panic. I am interested primarily in the former currently. It's not obvious if this is a hardware or device-driver issue ..)
@bsdimp: @jrtc27 tells me you may recently have pursued some problems along these lines ..?
In a nutshell, what's happening...
We get a timeout for the drive, we decide (for reasons I still don't know) that the drive is AFU and we should drop it... a short (very short?) time later it's fine and we re-add it. At this point, nothing good can happen. All the filesystems that were on it are unmounted (or at least disconnected) and we sometimes, like this panic shows, wind up doing the reattach of the device from a thread that can't do a sleep (normally we queue this).
I've long wanted to add a small debounce to AHCI's handling of drives that timeout to give them up to a minute to return before we actually invalidate the drive and reattach. I've not had a chance to make progress on this because it's a low priority for me and my employer.
Full dmesg:
---<<BOOT>>---
GDB: no debug ports present
KDB: debugger backends: ddb
KDB: current backend: ddb
Copyright 2011-2022 University of Cambridge.
Copyright 2012-2022 SRI International.
Copyright (c) 1992-2022 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 14.0-CURRENT #0 bsdinstall_simplification-0e6bc1d1ecb: Sun Apr 10 09:41:43 BST 2022
rnw24@zeno.sec.cl.cam.ac.uk:/home/rnw24/cheri/build/cheribsd-morello-purecap-build/home/rnw24/cheri/cheribsd/arm64.aarch64c/sys/GENERIC-MORELLO arm64
clang version 13.0.0 (https://git.morello-project.org/morello/llvm-project.git 4215d7472c49b03bbf2b3fdcd89c954451caa223)
WARNING: WITNESS option enabled, expect reduced performance.
WARNING: Per-thread SSP is enabled but the compiler is too old to support it
CHERI hybrid kernel.
VT: init without driver.
module firmware already present!
real memory = 17160749056 (16365 MB)
avail memory = 16706228224 (15932 MB)
Starting CPU 1 (100)
Starting CPU 2 (10000)
Starting CPU 3 (10100)
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
random: unblocking device.
random: entropy device external interface
MAP f9560000 mode 2 pages 80
MAP f95b0000 mode 2 pages 80
MAP f9600000 mode 2 pages 80
MAP f9650000 mode 2 pages 80
MAP f96a0000 mode 2 pages 80
MAP f96f0000 mode 2 pages 80
MAP f9750000 mode 2 pages 80
MAP f97a0000 mode 2 pages 96
MAP f9800000 mode 2 pages 80
MAP f9850000 mode 2 pages 80
MAP f98a0000 mode 2 pages 80
MAP f98f0000 mode 2 pages 80
MAP fe960000 mode 2 pages 80
MAP 1c100000 mode 0 pages 1
kbd0 at kbdmux0
acpi0: <ARMLTD ARMLFACP>
acpi0: Power Button (fixed)
acpi0: Sleep Button (fixed)
acpi0: Could not update all GPEs: AE_NOT_CONFIGURED
psci0: <ARM Power State Co-ordination Interface Driver> on acpi0
gic0: <ARM Generic Interrupt Controller v3.0> iomem 0x30000000-0x3001ffff,0x300c0000-0x301bffff on acpi0
its0: <ARM GIC Interrupt Translation Service> on gic0
its1: <ARM GIC Interrupt Translation Service> on gic0
its2: <ARM GIC Interrupt Translation Service> on gic0
its3: <ARM GIC Interrupt Translation Service> on gic0
generic_timer0: <ARM Generic Timer> irq 2,3,4 on acpi0
Timecounter "ARM MPCore Timecounter" frequency 50000000 Hz quality 1000
Event timer "ARM MPCore Eventtimer" frequency 50000000 Hz quality 1000
efirtc0: <EFI Realtime Clock>
efirtc0: registered as a time-of-day clock, resolution 1.000000s
pmu0: <Performance Monitoring Unit> on acpi0
cpu0: <ACPI CPU> on acpi0
uart0: <PrimeCell UART (PL011)> iomem 0x2a400000-0x2a400fff irq 0 on acpi0
uart0: console (115200,n,8,1)
uart1: <PrimeCell UART (PL011)> iomem 0x1c0a0000-0x1c0a0fff irq 1 on acpi0
pcib0: <Generic PCI host controller> on acpi0
pci0: <PCI bus> on pcib0
pcib1: <PCI-PCI bridge> at device 0.0 on pci0
pci1: <PCI bus> on pcib1
pcib2: <PCI-PCI bridge> mem 0x60300000-0x6033ffff at device 0.0 on pci1
pci2: <PCI bus> on pcib2
pcib3: <PCI-PCI bridge> at device 8.0 on pci2
pci3: <PCI bus> on pcib3
pcib4: <PCI-PCI bridge> at device 16.0 on pci2
pci4: <PCI bus> on pcib4
pcib5: <PCI-PCI bridge> at device 17.0 on pci2
pci5: <PCI bus> on pcib5
pcib6: <PCI-PCI bridge> at device 18.0 on pci2
pci6: <PCI bus> on pcib6
ahci0: <Marvell 88SE9170 AHCI SATA controller> port 0x1018-0x101f,0x1024-0x1027,0x1010-0x1017,0x1020-0x1023,0x1000-0x100f mem 0x60200000-0x602001ff at device 0.0 on pci6
ahci0: AHCI v1.00 with 2 6Gbps ports, Port Multiplier supported with FBS
ahci0: quirks=0x1000000<IOMMU_BUSWIDE>
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
pcib7: <PCI-PCI bridge> at device 19.0 on pci2
pci7: <PCI bus> on pcib7
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> mem 0x60100000-0x60100fff,0x900000000-0x900003fff at device 0.0 on pci7
re0: Using 1 MSI-X message
re0: turning off MSI enable bit.
re0: Chip rev. 0x4c000000
re0: MAC rev. 0x00000000
miibus0: <MII bus> on re0
rgephy0: <RTL8251/8153 1000BASE-T media interface> PHY 1 on miibus0
rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 00:02:f7:00:97:15
pcib8: <PCI-PCI bridge> at device 20.0 on pci2
pci8: <PCI bus> on pcib8
xhci0: <XHCI (generic) USB 3.0 controller> mem 0x60000000-0x6000ffff,0x60010000-0x60011fff at device 0.0 on pci8
xhci0: 64 bytes context size, 64-bit DMA
usbus0 on xhci0
pcib9: <Generic PCI host controller> on acpi0
pci9: <PCI bus> on pcib9
armv8crypto0: <AES-CBC,AES-XTS,AES-GCM>
Timecounters tick every 1.000 msec
CPU 0: Research Morello SoC r0p0 affinity: 0 0 0
Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG,IDC,DIC>
Instruction Set Attributes 0 = <DP,RDM,Atomic,CRC32,SHA2,SHA1,AES+PMULL>
Instruction Set Attributes 1 = <RCPC-8.3,DCPoP>
Processor Features 0 = <CSV3,CSV2,RAS,GIC,AdvSIMD+HP,FP+HP,EL3,EL2,EL1,EL0>
Processor Features 1 = <PSTATE.SSBS MSR,Morello>
Memory Model Features 0 = <TGran4,TGran64,TGran16,SNSMem,16bit ASID,256TB PA>
Memory Model Features 1 = <XNX,PAN+ATS1E1,LO,HPD+TTPBHA,VH,16bit VMID,HAF+DS>
Memory Model Features 2 = <EVT-8.2,32bit CCIDX,48bit VA,IESB,UAO,CnP>
Debug Features 0 = <DoubleLock,SPE,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3 v8.1,Debugv8.2>
Debug Features 1 = <>
Auxiliary Features 0 = <>
Auxiliary Features 1 = <>
CPU 1: Research Morello SoC r0p0 affinity: 0 1 0
CPU 2: Research Morello SoC r0p0 affinity: 1 0 0
CPU 3: Research Morello SoC r0p0 affinity: 1 1 0
Release APs...usbus0: 5.0Gbps Super Speed USB v3.0
done
WARNING: WITNESS option enabled, expect reduced performance.
Trying to mount root from ufs:/dev/ada0p2 [rw]...
WARNING: Per-thread SSP is enabled but the compiler is too old to support it
ugen0.1: <(0x104c) XHCI root HUB> at usbus0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF400400> ACS-3 ATA SATA 3.x device
ada0: Serial Number 214607A017B6
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 228936MB (468862128 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>
uhub0 on usbus0
uhub0: <(0x104c) XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0
uhub0: 8 ports with 8 removable, self powered
lo0: link state changed to UP
re0: link state changed to DOWN
re0: link state changed to UP
Security policy loaded: MAC/ntpd (mac_ntpd)
So, threading through the code, we're invalidating the periph for some reason:
We only schedule a restart of the probe when we're scanning a lun and the periph is marked as invalid as far as I can tell.
So I remain confused as to how we come to generating the AC_LOST_DEVICE event that I think is kicking this sequence off.
Tagging @markjdb, @jrtc27, @bukinr who have been opining on this issue also.
I can give patches to the kernel to provide more data if someone can recreate it.
The malloc from a bad place thing is about 4 things downstream of the cause, btw. Even if we fixed it, we'd still have the periph departing and arriving (from CAM's point of view) quickly in a non-hot-plug situation (the only path I can find to get us here goes through code that was put in for hotplug to defer the async call, which is the wrong context to do this stuff in...).
So, the command that's timing out is a NCQ command, and a bunch of the Marvell AHCI controllers (not including this one) have a AHCI_Q_NONCQ quirk set. Should we try setting that quirk for this controller as well, so as to narrow things down a bit?
It might help (it wouldn't hurt to try it). If that's it, it would fix the root cause...
But the problem is we're generating a AC_LOST_DEVICE event (where I'm still trying to find) and then getting an AC_FOUND_DEVICE event and the two aren't playing well together (we shouldn't generate the AC_LOST_DEVICE event in the first place until we're done some kind of recovery phase, and that's a problem I've hit at work and could work on if there's an easy way to trigger it)
I can give patches to the kernel to provide more data if someone can recreate it.
This panic is very reproducible for me -- it happens (pretty much) whenever I have bsdinstall install ports.txz from the USB installer image. So I’m happy to try patches.
This panic is very reproducible for me
Maybe you should just send me the system it happens on (wink wink wink).
So I’m happy to try patches.
Well, patches likely would be quicker since I have a backlog of hardware installations as it is right now. I'll be in touch tomorrow with them I hope.
If you look at the patch, much could be done with dtrace if that's easier. xpt_async(), cam_periph_invalidate() and cam_periph_alloc() could be traces, with their arguments as well. I don't know if it's available in this target environment (maybe via the emergency holographic shell?). I think the following dtrace script might be helpful, but my system is failing to initialize it, so clearly I have some kind of pilot error...
#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
printf("Tracing... Hit Ctrl-C to end.\n");
}
fbt::xpt_async:entry
{
printf("\nxpt_async(%x)\n", arg0);
stack();
}
fbt::cam_periph_invalidate:entry
{
printf("\ncam_periph_invalidate(periph=%p)\n", arg0);
stack();
}
fbt::cam_periph_alloc:entry
{
printf("\ncam_periph_alloc(%s, path=%p)\n", stringof(arg4), arg6);
stack();
}
May be enough, though I may need to add data to cam_periph_invalidate ...
#pragma D option quiet
dtrace:::BEGIN
{
printf("Tracing... Hit Ctrl-C to end.\n");
}
fbt::xpt_async:entry
{
printf("\nxpt_async(code=0x%x)\n", arg0);
stack();
}
fbt::adaasync:entry
{
printf("\nadaasync(%p, code=0x%x, path=%p)\n", arg0, arg1, arg2);
stack();
}
fbt::daasync:entry
{
printf("\ndaasync(%p, code=0x%x, path=%p)\n", arg0, arg1, arg2);
stack();
}
fbt::passasync:entry
{
printf("\npassasync(%p, code=0x%x, path=%p)\n", arg0, arg1, arg2);
stack();
}
fbt::cam_periph_invalidate:entry
{
printf("\ncam_periph_invalidate(periph=%p, name=%s%d, flags=0x%x)\n", arg0, stringof(args[0]->periph_name), args[0]->unit_number, args[0]->flags);
stack();
}
fbt::xpt_async_process:entry
{
printf("\nxpt_async_process(periph=%p, name=%s%d)\n", arg0, stringof(args[0]->periph_name), args[0]->unit_number);
stack();
}
fbt::cam_periph_alloc:entry
{
printf("\ncam_periph_alloc(%s, path=%p)\n", stringof(arg4), arg6);
stack();
}
Is a more refined thing and my best guess at the moment at what would help me find this. I refined it by tracing through a few different eject of flash drive scenarios... I'll wait to hear from you on this...
Unfortunately, we don’t yet have DTrace support in CheriBSD for CHERI-enabled kernels. But if I can reproduce this with a non-CHERI kernel, I can certainly give this a try.
cheribuild cheribsd-release-aarch64
should be a working thing for that case
So any word either on the patches I pointed people at (to use a modified installer) or the dtrace script I posted (less info, but maybe enough, but no mods)?
So any word either on the patches I pointed people at (to use a modified installer) or the dtrace script I posted (less info, but maybe enough, but no mods)?
I've been running a kernel with the printf patches for the past few days but haven't run into the panic yet.
4-day holiday weekend here just wrapping up, but I've kicked off a build of the baseline without your patches to confirm all is sad tomorrow morning before trying the patched version.
I'm running this one for a month with no issues on Morello:
=== START OF INFORMATION SECTION === Model Family: Samsung based SSDs Device Model: Samsung SSD 870 EVO 4TB Serial Number: S6BCNJ0RA04151P LU WWN Device Id: 5 002538 f31a14521 Firmware Version: SVT01B6Q User Capacity: 4,000,787,030,016 bytes [4.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches TRIM Command: Available, deterministic, zeroed Device is: In smartctl database 7.3/5319 ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Apr 19 21:36:01 2022 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled
You've likely just not been hammering it enough. I've been using the exact same SSD model and a board from the same batch as yours but was the first to hit the AHCI/SATA issues.
@bsdimp I've just hit the panic while rsync'ing lots of files over to the board, here's the output from you patch:
WARNING: WITNESS option enabled, expect reduced performance.
(aprobe0:ahcich0:0:0:0): Clearing unconfigured and sending FOUND_DEVICE 1
Trying to mount root from ufs:/dev/ada0p2 [rw]...
dWARNING: Per-thread SSP is enabled but the compiler is too old to support it
bKDB: stack backtrace:
_tdbrace_self
_db_trace_setlf_wrapracep_eself() art () at db_db_trace_setlf_wrapper+race_self
0x30
dcam_pb_trace_self_wrapperiph_invalidateer() at () adb_trace_self_wrapper+0t x30
cam_aperiph_invaprobedone() lidatae+0xbt8
aprobedaone+0xc08
probedone()x pt_done_pat rocess() aaprobedonet +0x36c
xpxptt_done_p_done_procesrs+0x364
oxpt_donegcu_intr: reg 10
_direct()c eat sxpt_sdone_direct+(0x48
)ah at ci_ch_intr_direct()xpt_done_pro at cess+0xa364
hci_ch_intr_directxpt_done_td+(0)x at c4
xpt_done_tahci_intr() at d+0xd8
ahci_intr+0xfco8r
k_exiit() at thrfeork_exita+0x74
d_loofork_trampop() at line(ithread_loop+0x2a0) at
fofork_exrk_trampolinit() at e+0fx10
or(aprobe1:ahcich1:0:0:0): Setting SSQ_LOST on CAM_DEV_NOT_THERE
k_exit+0x(aprobe1:ahcich1:0:0:ffffffff): sending LOST DEVICE
74
(aprobe1:ahcich1:0:0:0): Periph invalidated
foKDB: stack backtrace:
rk_trampolidne() at b_ftrace_self(ork_trampoli)ne+0x10
(aprobe0:ahcich0:0:0:0): Periph invalidated
aKDB: stack backtrace:
t db_trace_self() at db_db_trace_setlf
race_sedb_trace_sellf_wrapper(f) at
db_trace_self_wrapper+0x30d
bc_am_periptrace_selh_invalidatef_wrapper()() at at cadm_periph_invalidate+0xb_tracbe8_se
lf_wrapper+0x30
aprobedone() atcam_p eriph_invalaprobedone+0x36c
idatex() at pt_dcam_periph_invalidate+0oxne_process() at b8
xaprobedone()p at t_doneaprobedone+0_x36c
process+0x364
xxpt_done_td() at pt_done_process()x at pxpt_dot_done_td+0xnd8
e_procfork_exit() ess+0x364
at xpt_done_fork_exit+0direct() at x74
xpfork_trampoline(t) at _donefork_trampoline+0x10
_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
cam_periph_alloc: attempt to re-allocate valid device pass0 rejected flags 0 refcount 1
passasync: Unable to attach new device due to status 0x6: CCB request was invalid
(noperiph:ahcich0:0:0:0): FOUND DEVICE
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF510000> ACS-2 ATA SATA 3.x device
ada0: Serial Number 21374Y448403
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 228944MB (468877312 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>
uhub0: 8 ports with 8 removable, self powered
....
....
root@sauerkirsche:~ # ahcich0: Timeout on slot 4 port 0
ahcich0: is 00000000 cs 00000000 ss ffffffff rs ffffffff tfd 40 serr 00000000 cmd 00710317
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 e8 75 9e 40 1a 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
(aprobe0:ahcich0:0:0:0): Setting SSQ_LOST on CAM_DEV_NOT_THERE
(aprobe0:ahcich0:0:0:ffffffff): sending LOST DEVICE
Setting unconfigured and sending LOST_DEVICE
(aprobe1:ahcich0:0:15:0): Setting SSQ_LOST on CAM_DEV_NOT_THERE
KDB: stack backtrace:
(aprobe1:ahcich0:0:15:ffffffff): sending LOST DEVICE
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
ata_dev_async() at ata_dev_async+0x12c
xpt_async_process_dev() at xpt_async_process_dev+0x16c
xptdevicetraverse() at xptdevicetraverse+0x9c
xpttargettraverse() at xpttargettraverse+0x78
xpt_async_process() at xpt_async_process+0x358
xpt_done_process() at xpt_done_process+0x364
xpt_async_td() at xpt_async_td+0xc4
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(noperiph:ahcich0:0:0:ffffffff): AC LOST cam_periph_async zapping it
(ada0:ahcich0:0:0:0): Periph invalidated
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_invalidate() at cam_periph_invalidate+0xb8
adaasync() at adaasync+0x2a4
xpt_async_process_dev() at xpt_async_process_dev+0x194
xptdevicetraverse() at xptdevicetraverse+0x9c
xpttargettraverse() at xpttargettraverse+0x78
xpt_async_process() at xpt_async_process+0x358
xpt_done_process() at xpt_done_process+0x364
xpt_async_td() at xpt_async_td+0xc4
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WDS240G2G0A-00JH30 UF510000> s/n 21374Y448403 detached
g_vfs_done():ada0p2[WRITE(offset=195387015168, length=4096)]error = 6
UFS: forcibly unmounting /dev/ada0p2 from /
g_vfs_done():ada0p2[WRITE(offset=195387043840, length=4096)]error = 6
...
lots more of those lines
...
g_vfs_done():ada0p2[WRITE(offset=24985600000, length=32768)]error = 6
(noperiph:ahcich0:0:0:ffffffff): AC LOST cam_periph_async zapping it
(pass0:ahcich0:0:0:0): Periph invalidated
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_invalidate() at cam_periph_invalidate+0xb8
xpt_async_process_dev() at xpt_async_process_dev+0x194
xptdevicetraverse() at xptdevicetraverse+0x9c
xpttargettraverse() at xpttargettraverse+0x78
xpt_async_process() at xpt_async_process+0x358
xpt_done_process() at xpt_done_process+0x364
xpt_async_td() at xpt_async_td+0xc4
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
g_vfs_done():ada0p2[WRITE(offset=228888219648, length=4096)]err(aprobe1:ahcich0:0:15:0): Periph invalidated
or = 6
KDB: stack backtrace:
(aprobe0:ahcich0:0:0:0): Clearing unconfigured and sending FOUND_DEVICE 1
db_trace_self() at KDB: stack backtrace:
dbdb_trace_s_elf() at trace_selfdb_trace_self
db_trace_dsb_tracelfe_wrapper() _self_wrappat er() at ddb_trace_self_wrapperb_trace_self_wrapper++00xx30
30a
pcam_periph_invalidate()r at obedone() at cam_periph_invaliaprobedone+0xc08
date+xpt_do0ne_process() at xb8
xpt_donaeprobedone() at _proceass+0x364pr
obedone+0xx3p6c
t_doxne_direct()p at t_donexpt_done_direct+0x48
_ahci_ch_intrprocess() a_direct() att xpt_doneahc_process+i0x364
_ch_intr_directx+0xc4
pathci_intr() at _done_td() at ahci_inxpt_done_td+0txd8
r+0xfork_exit() c8
at ithrfork_exit+0xead_loop() 74
atfork_t rampoline() at ithread_loop+fork_0x2a0
trforampoline+0x10
k_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(aprobe0:ahcich0:0:0:0): Periph invalidated
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_invalidate() at cam_periph_invalidate+0xb8
aprobedone() at aprobedone+0x36c
xpt_done_process() at xpt_done_process+0x364
xpt_done_direct() at xpt_done_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(noperiph:ahcich0:0:0:0): FOUND DEVICE
(ada0:ahcich0:0:0:0): Hit the hot plug path with ac = 0x80
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_alloc() at cam_periph_alloc+0x8c
adaasync() at adaasync+0xb8
xpt_async_process_dev() at xpt_async_process_dev+0x194
xpt_async_process() at xpt_async_process+0x3a4
xpt_done_process() at xpt_done_process+0x364
xpt_async_td() at xpt_async_td+0xc4
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(aprobe0:ahcich0:0:0:0): not changed
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
aprobedone() at aprobedone+0x1a14
xpt_done_process() at xpt_done_process+0x364
xpt_done_direct() at xpt_done_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(aprobe0:ahcich0:0:0:0): Periph invalidated
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_invalidate() at cam_periph_invalidate+0xb8
aprobedone() at aprobedone+0x36c
xpt_done_process() at xpt_done_process+0x364
xpt_done_direct() at xpt_done_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(ada0:ahcich0:0:0:0): Periph destroyed
(ada0:ahcich0:0:0:0): Somebody set us up the new periph flag
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
cam_periph_release_locked_buses() at cam_periph_release_locked_buses+0x234
cam_periph_release_locked() at cam_periph_release_locked+0x1c
xpt_done_process() at xpt_done_process+0x364
xpt_done_direct() at xpt_done_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
(ada0:ahcich0:0:0:0): FOUND DEVICE
panic: malloc(M_WAITOK) with sleeping prohibited
cpuid = 3
time = 1650403684
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x70
malloc_dbg() at malloc_dbg+0xf8
malloc() at malloc+0x2c
drm_atomic_state_alloc() at drm_atomic_state_alloc+0x3c
restore_fbdev_mode_atomic() at restore_fbdev_mode_atomic+0x40
drm_fb_helper_restore_fbdev_mode_unlocked() at drm_fb_helper_restore_fbdev_mode_unlocked+0x7c
vt_kms_postswitch() at vt_kms_postswitch+0x58
vt_window_switch() at vt_window_switch+0x1b4
vtterm_cngrab() at vtterm_cngrab+0x34
cngrab() at cngrab+0x28
vpanic() at vpanic+0xe4
panic() at panic+0x70
malloc_dbg() at malloc_dbg+0xf8
malloc() at malloc+0x2c
disk_alloc() at disk_alloc+0x1c
adaregister() at adaregister+0x264
cam_periph_alloc() at cam_periph_alloc+0x5a4
adaasync() at adaasync+0xb8
cam_periph_release_locked_buses() at cam_periph_release_locked_buses+0x2cc
cam_periph_release_locked() at cam_periph_release_locked+0x1c
xpt_done_process() at xpt_done_process+0x364
xpt_done_direct() at xpt_done_direct+0x48
ahci_ch_intr_direct() at ahci_ch_intr_direct+0xc4
ahci_intr() at ahci_intr+0xc8
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x10
KDB: enter: panic
[ thread pid 12 tid 100066 ]
Stopped at kdb_enter+0x44: undefined f907811f
db>
@arichardson Thanks! the panic at the end, I understand, and I think I'm starting to understand the rest. It will take me some time to thread through the code with this data, but it's exactly the information that I need. We get a timeout, the device isn't there when the wheels come off the go cart when we try to retry the I/O. Why we get the timeout is beyond my pay grade (it's likely some bad silicon that can't handle the load, but w/o that silicon or access to its errata, I'm not sure I can help). What I think I can do is to make the 'recovery from the error' work better so you will still get the timeout, but our recovery code will defer retrying the I/O until we've completed a reset of the AHCI channel because otherwise it isn't there and we get errors leading to tearing down the device, only to discover it again a few tens of ms later leading to this panic.
diff --git a/FreeBSD/sys/cam/ata/ata_da.c b/FreeBSD/sys/cam/ata/ata_da.c
index d1bda462815b..adf175fd0eea 100644
--- a/FreeBSD/sys/cam/ata/ata_da.c
+++ b/FreeBSD/sys/cam/ata/ata_da.c
@@ -2874,7 +2874,7 @@ adadone(struct cam_periph *periph, union ccb *done_ccb)
cam_periph_lock(periph);
bp = (struct bio *)done_ccb->ccb_h.ccb_bp;
if ((done_ccb->ccb_h.status & CAM_STATUS_MASK) != CAM_REQ_CMP) {
- error = adaerror(done_ccb, 0, 0);
+ error = adaerror(done_ccb, CAM_RETRY_SELTO, 0);
if (error == ERESTART) {
/* A retry was scheduled, so just return. */
cam_periph_unlock(periph);
Might fix ths problem. It appears that we're not retrying selection timeouts, which is what a timeout returns. Without this, we treat any timeout as a lost device (at least that's my first blush at tracing through things based on the debug messages). With this, we should, if we have retries, try them again.
Now, WHY this timeouts out, I'm not sure I can say. But if you could try this, see if you can trigger a timeout and see if we recover correctly. that would at least allow us you to recover when this happens better than the panic.
I wonder if the 'timeout' may be due to a missing interrupt connected to the marvel controller and if that' possibility has been explored and/or eliminated already.
Applying the patch in #1359, I am now able to install ports.txz and src.txz from bsdinstall 4+ times in a row without experiencing a drive detachment. No information as to whether NCQ is the problem itself, or whether it’s simply that NCQ is required to achieve I/O rates necessary to trigger a non-NCQ-related problem. Thus far this seems a reasonable workaround for the next CheriBSD release, but I will continue testing.
Cool the quirk worked. I've found other ways to simulate the situation so I'm going to stop working on this
Should we also cherry-pick https://github.com/freebsd/freebsd-src/commit/6c8ab086fed37a6b44fa84377e48c499f223ae80 or just wait until the next upstream merge?
Should we also cherry-pick freebsd/freebsd-src@6c8ab08 or just wait until the next upstream merge?
Unless you are seeing the panic where we're allocating a new device with the NCQ disabling workaround, there's little point.
During installation of ports.txz, which is rife full of directories and files, I experienced the attached failure, in which ada0 detached following a timeout. This leads to an eventual panic due to apparently poor handling of peripheral destruction (M_WAITOK malloc from an inappropriate thread).