Closed geohot closed 3 months ago
So this crash isn't particularly interesting, it's a ring buffer overrun. You can make a decision to not validate input and let the SDMA block crash, and that might not even be a bad decision.
You can fix this by upping the ring size to 0x4000+ on line 408 of ops_kfd.py
self.sdma_ring = self._gpu_alloc(0x1000, kfd.KFD_IOC_ALLOC_MEM_FLAGS_USERPTR, uncached=True)
What I'm much more interested in is the recovery behavior being fixed. The first time you get an SDMA crash, then the GPU "recovers", but the next time you get an MES crash, even if you fix the ring size. The third time it seems recovered, except dmesg is spammed with
[355179.718600] [drm] Skip scheduling IBs!
[355179.718675] [drm] Skip scheduling IBs!
[355179.718684] [drm] Skip scheduling IBs!
[355179.718692] [drm] Skip scheduling IBs!
[355179.718700] [drm] Skip scheduling IBs!
[355179.718708] [drm] Skip scheduling IBs!
[355179.718716] [drm] Skip scheduling IBs!
[355179.718723] [drm] Skip scheduling IBs!
from that point forward.
yeah makes sense, thanks man
As a meta point, the amount I'll engage with this depends on the level of technical responses I get. This is my test issue.
If it's all a black box, I won't report more. But if someone who understands the design choices made in the firmware is engaging in here on a technical level, I'll engage. I'll put in effort if I see effort reciprocated. But not if I keep just hearing "it'll be fixed in the next release"
Was the design intention for the firmware to crash on invalid queues?
Well stated.
I didn't see sdma hang but my cards are W7900 Pros. Let me get a machine with 2x7900XTXs to reproduce and get some FW people involved.
I'm on the ROCm 6.0.3 beta btw, don't know if SDMA was changed. (looks to be the same 0x13 as 6.0.0)
tiny@tiny5:/sys/kernel$ sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 29, firmware version: 0x0000080c
PFP feature version: 29, firmware version: 0x0000083e
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000076
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x00000019
RLCV feature version: 1, firmware version: 0x00000022
MEC feature version: 29, firmware version: 0x00000834
IMU feature version: 0, firmware version: 0x0b1f4b00
SOS feature version: 3211312, firmware version: 0x00310030
ASD feature version: 553648315, firmware version: 0x210000bb
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x1b000205
TA HDCP feature version: 0x00000000, firmware version: 0x1700003a
TA DTM feature version: 0x00000000, firmware version: 0x12000015
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x004e7c00 (78.124.0)
SDMA0 feature version: 60, firmware version: 0x00000013
SDMA1 feature version: 60, firmware version: 0x00000013
VCN feature version: 0, firmware version: 0x05110006
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x07002100
TOC feature version: 12, firmware version: 0x0000000c
MES_KIQ feature version: 6, firmware version: 0x00000075
MES feature version: 1, firmware version: 0x00000057
VBIOS version: 113-31XFSHBS1-L02
Closed issue. Without documentation, it's impossible to know what the intended behavior is. Can revisit when the SDMA IP block is documented.
2x7900XTX
commit ab10d67c0a1fe759948a25040f22a6f055754c8d in tinygrad
repro with
python3 test/external/external_test_hcq.py TestHCQ.test_copy_bandwidth