amd / fuzzyHSA

Apache License 2.0
50 stars 1 forks source link

Repeatable SDMA firmware crash #2

Closed geohot closed 3 months ago

geohot commented 3 months ago

2x7900XTX

commit ab10d67c0a1fe759948a25040f22a6f055754c8d in tinygrad

repro with python3 test/external/external_test_hcq.py TestHCQ.test_copy_bandwidth

[354561.777034] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[354561.778054] amdgpu: failed to remove hardware queue from MES, doorbell=0x1202
[354561.778555] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[354561.779054] amdgpu: Failed to evict queue 1
[354561.779548] amdgpu: Failed to evict process queues
[354561.780040] amdgpu: Failed to quiesce KFD
[354561.789279] amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
[354561.805179] amdgpu: Failed to remove queue 0
[354562.802028] [drm:sdma_v6_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out
[354562.804532] amdgpu 0000:c3:00.0: amdgpu: IP block:sdma_v6_0 is hung!
[354562.820132] amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
[354562.820144] amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
[354562.820223] [drm] Skip scheduling IBs!
[354562.820246] [drm] Skip scheduling IBs!
[354562.820261] [drm] Skip scheduling IBs!
[354562.820272] [drm] Skip scheduling IBs!
[354562.820283] [drm] Skip scheduling IBs!
[354562.820293] [drm] Skip scheduling IBs!
[354562.820304] [drm] Skip scheduling IBs!
[354562.820313] [drm] Skip scheduling IBs!
[354562.820324] [drm] Skip scheduling IBs!
[354562.820335] [drm] Skip scheduling IBs!
[354562.820346] [drm] Skip scheduling IBs!
[354562.820730] [drm] ring gfx_32774.1.1 was added
[354562.820975] [drm] ring compute_32774.2.2 was added
[354562.821226] [drm] ring sdma_32774.3.3 was added
[354562.821299] [drm] ring gfx_32774.1.1 ib test pass
[354562.821339] [drm] ring compute_32774.2.2 ib test pass
[354562.821376] [drm] ring sdma_32774.3.3 ib test pass
[354562.821877] amdgpu 0000:c3:00.0: amdgpu: GPU reset(12) succeeded!
geohot commented 3 months ago

So this crash isn't particularly interesting, it's a ring buffer overrun. You can make a decision to not validate input and let the SDMA block crash, and that might not even be a bad decision.

You can fix this by upping the ring size to 0x4000+ on line 408 of ops_kfd.py self.sdma_ring = self._gpu_alloc(0x1000, kfd.KFD_IOC_ALLOC_MEM_FLAGS_USERPTR, uncached=True)

What I'm much more interested in is the recovery behavior being fixed. The first time you get an SDMA crash, then the GPU "recovers", but the next time you get an MES crash, even if you fix the ring size. The third time it seems recovered, except dmesg is spammed with

[355179.718600] [drm] Skip scheduling IBs!
[355179.718675] [drm] Skip scheduling IBs!
[355179.718684] [drm] Skip scheduling IBs!
[355179.718692] [drm] Skip scheduling IBs!
[355179.718700] [drm] Skip scheduling IBs!
[355179.718708] [drm] Skip scheduling IBs!
[355179.718716] [drm] Skip scheduling IBs!
[355179.718723] [drm] Skip scheduling IBs!

from that point forward.

zstreet87 commented 3 months ago

yeah makes sense, thanks man

geohot commented 3 months ago

As a meta point, the amount I'll engage with this depends on the level of technical responses I get. This is my test issue.

If it's all a black box, I won't report more. But if someone who understands the design choices made in the firmware is engaging in here on a technical level, I'll engage. I'll put in effort if I see effort reciprocated. But not if I keep just hearing "it'll be fixed in the next release"

Was the design intention for the firmware to crash on invalid queues?

stellaraccident commented 3 months ago

Well stated.

zstreet87 commented 3 months ago

I didn't see sdma hang but my cards are W7900 Pros. Let me get a machine with 2x7900XTXs to reproduce and get some FW people involved.

image

geohot commented 3 months ago

I'm on the ROCm 6.0.3 beta btw, don't know if SDMA was changed. (looks to be the same 0x13 as 6.0.0)

tiny@tiny5:/sys/kernel$ sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 29, firmware version: 0x0000080c
PFP feature version: 29, firmware version: 0x0000083e
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000076
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x00000019
RLCV feature version: 1, firmware version: 0x00000022
MEC feature version: 29, firmware version: 0x00000834
IMU feature version: 0, firmware version: 0x0b1f4b00
SOS feature version: 3211312, firmware version: 0x00310030
ASD feature version: 553648315, firmware version: 0x210000bb
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x1b000205
TA HDCP feature version: 0x00000000, firmware version: 0x1700003a
TA DTM feature version: 0x00000000, firmware version: 0x12000015
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x004e7c00 (78.124.0)
SDMA0 feature version: 60, firmware version: 0x00000013
SDMA1 feature version: 60, firmware version: 0x00000013
VCN feature version: 0, firmware version: 0x05110006
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x07002100
TOC feature version: 12, firmware version: 0x0000000c
MES_KIQ feature version: 6, firmware version: 0x00000075
MES feature version: 1, firmware version: 0x00000057
VBIOS version: 113-31XFSHBS1-L02
geohot commented 3 months ago

Closed issue. Without documentation, it's impossible to know what the intended behavior is. Can revisit when the SDMA IP block is documented.