ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
327 stars 98 forks source link

[Issue]: Page fault + Failed to initialize parser #168

Closed Mr-Andersen closed 1 month ago

Mr-Andersen commented 2 months ago

Problem Description

I get [gfxhub] page fault and then [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

I couldn't find my GPU (AMD Radeon RX 6400 / gfx1034) in the list below :( So I chose the first one. Additionally, I didn't find instructions on how to find out my ROCm version; I am guessing it's 6.0.2 since it's the default one on current NixOS.

Operating System

NixOS 24.05 (Uakari)

CPU

AMD Ryzen 5 3600 6-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

Run wezterm; wait

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 5 3600 6-Core Processor  
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 3600 6-Core Processor  
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16323632(0xf91430) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16323632(0xf91430) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16323632(0xf91430) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1034                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6400                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
    L3:                      16384(0x4000) KB                   
  Chip ID:                 29759(0x743f)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2320                               
  BDFID:                   11008                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 118                                
  SDMA engine uCode::      34                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4177920(0x3fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    4177920(0x3fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1034         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

$ uname -a
Linux big-system 6.6.44 #1-NixOS SMP PREEMPT_DYNAMIC Sat Aug  3 06:54:42 UTC 2024 x86_64 GNU/Linux
$ dmesg | rg amdgpu
[    0.000000] Command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.044886] Kernel command line: initrd=\EFI\nixos\i6l1b3gwjhmgqfha8wirqnwwi2d7z5lw-initrd-linux-6.6.44-initrd.efi init=/nix/store/abdplibma8crxqczj3n3nisq8qzkb8zs-nixos-system-big-system-24.05.20240810.a781ff3/init amdgpu.runpm=0 nohibernate loglevel=4
[    0.529091] stage-1-init: [Mon Aug 12 15:52:45 UTC 2024] loading module amdgpu...
[    2.956418] [drm] amdgpu kernel modesetting enabled.
[    2.956542] amdgpu: Virtual CRAT table created for CPU
[    2.956560] amdgpu: Topology: Add CPU node
[    2.960616] amdgpu 0000:2b:00.0: No more image in the PCI ROM
[    2.960634] amdgpu 0000:2b:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    2.960639] amdgpu: ATOM BIOS: 115-D632BP2-100
[    2.986743] amdgpu 0000:2b:00.0: vgaarb: deactivate vga console
[    2.986746] amdgpu 0000:2b:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    2.986816] amdgpu 0000:2b:00.0: amdgpu: VRAM: 4080M 0x0000008000000000 - 0x00000080FEFFFFFF (4080M used)
[    2.986819] amdgpu 0000:2b:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    2.986821] amdgpu 0000:2b:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    2.986943] [drm] amdgpu: 4080M of VRAM memory ready
[    2.986945] [drm] amdgpu: 7970M of GTT memory ready.
[    4.884787] amdgpu 0000:2b:00.0: amdgpu: STB initialized to 2048 entries
[    4.885580] amdgpu 0000:2b:00.0: amdgpu: Will use PSP to load VCN firmware
[    5.053960] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    5.069648] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    5.069672] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[    5.069675] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[    5.069708] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[    5.112009] amdgpu 0000:2b:00.0: amdgpu: SMU is initialized successfully!
[    5.166577] amdgpu: HMM registered 4080MB device memory
[    5.167775] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    5.167797] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[    5.167992] amdgpu: Virtual CRAT table created for GPU
[    5.168158] amdgpu: Topology: Add dGPU node [0x743f:0x1002]
[    5.168160] kfd kfd: amdgpu: added device 1002:743f
[    5.168180] amdgpu 0000:2b:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 8, active_cu_number 12
[    5.169022] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    5.169025] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    5.169026] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    5.169028] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    5.169030] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    5.169031] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    5.169033] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    5.169034] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    5.169036] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    5.169037] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[    5.169039] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    5.169041] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[    5.170539] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:2b:00.0 on minor 1
[    5.176756] fbcon: amdgpudrmfb (fb0) is primary device
[    5.268724] amdgpu 0000:2b:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[   10.829284] snd_hda_intel 0000:2b:00.1: bound 0000:2b:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[  273.406616] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406641] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  273.406645] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  273.406649] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[  273.406652] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  273.406656] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  273.406658] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  273.406661] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  273.406664] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  273.406675] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406680] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  273.406684] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406686] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406689] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  273.406692] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  273.406694] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  273.406696] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  273.406700] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  273.406706] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406710] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  273.406714] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406717] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406720] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  273.406722] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  273.406725] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  273.406728] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  273.406730] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  273.406737] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  273.406741] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  273.406743] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  273.406746] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  273.406749] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  273.406752] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  273.406755] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  273.406757] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  273.406760] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724101] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724123] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080050560a000 from client 0x1b (UTCL2)
[  283.724128] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00701031
[  283.724131] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[  283.724134] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  283.724137] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724139] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  283.724141] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724143] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724153] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724157] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800505606000 from client 0x1b (UTCL2)
[  283.724161] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724163] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724166] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  283.724168] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724170] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  283.724172] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724175] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724181] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724185] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x000080019560e000 from client 0x1b (UTCL2)
[  283.724188] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724191] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724194] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  283.724196] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724199] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  283.724202] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724204] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724211] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724216] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724219] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724221] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724224] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  283.724226] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724228] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  283.724230] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724232] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724239] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724242] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724244] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724246] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724248] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  283.724250] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724252] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  283.724254] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724255] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.724262] amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:7 pasid:32771, for process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072)
[  283.724265] amdgpu 0000:2b:00.0: amdgpu:   in page starting at address 0x0000800195612000 from client 0x1b (UTCL2)
[  283.724267] amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  283.724269] amdgpu 0000:2b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB (0x0)
[  283.724271] amdgpu 0000:2b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  283.724272] amdgpu 0000:2b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  283.724274] amdgpu 0000:2b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  283.724276] amdgpu 0000:2b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  283.724278] amdgpu 0000:2b:00.0: amdgpu:      RW: 0x0
[  283.733953] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=49294, emitted seq=49296
[  283.734711] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process wezterm-gui pid 5044 thread wezterm-gu:cs0 pid 5072
[  283.735063] amdgpu 0000:2b:00.0: amdgpu: GPU reset begin!
[  283.916028] amdgpu 0000:2b:00.0: amdgpu: MODE1 reset
[  283.916038] amdgpu 0000:2b:00.0: amdgpu: GPU mode1 reset
[  283.916121] amdgpu 0000:2b:00.0: amdgpu: GPU smu mode1 reset
[  284.420071] amdgpu 0000:2b:00.0: amdgpu: GPU reset succeeded, trying to resume
[  284.600682] amdgpu 0000:2b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  284.616890] amdgpu 0000:2b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  284.616933] amdgpu 0000:2b:00.0: amdgpu: SMU is resuming...
[  284.616940] amdgpu 0000:2b:00.0: amdgpu: smu driver if version = 0x0000000d, smu fw if version = 0x00000010, smu fw program = 0, version = 0x00492400 (73.36.0)
[  284.616945] amdgpu 0000:2b:00.0: amdgpu: SMU driver if version not matched
[  284.616980] amdgpu 0000:2b:00.0: amdgpu: use vbios provided pptable
[  284.661035] amdgpu 0000:2b:00.0: amdgpu: SMU is resumed successfully!
[  284.744332] amdgpu 0000:2b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  284.744336] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  284.744339] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  284.744342] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  284.744344] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  284.744347] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  284.744349] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  284.744352] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  284.744354] amdgpu 0000:2b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  284.744356] amdgpu 0000:2b:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  284.744359] amdgpu 0000:2b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  284.744361] amdgpu 0000:2b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  284.747468] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow start
[  284.752213] amdgpu 0000:2b:00.0: amdgpu: recover vram bo from shadow done
[  284.752276] amdgpu 0000:2b:00.0: amdgpu: GPU reset(2) succeeded!
[  284.779374] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
ppanchad-amd commented 2 months ago

@Mr-Andersen Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 2 months ago

Hi @Mr-Andersen, are your kernel and mesa libs up to date? It seems like there are some similar issues caused by out-of-date kernel and mesa versions.

Mr-Andersen commented 2 months ago

@schung-amd I've upgraded to Linux 6.6.47 and Mesa 24.2.0, still having the issue. Can't bump Linux further yet, since I need ZFS. In which versions were those similar issues resolved?

schung-amd commented 2 months ago

Thanks for checking! It's unclear what versions help; one user's issues were fixed by a kernel update in 6.5.x with mesa 23.1.7 (https://www.reddit.com/r/linux_gaming/comments/16jhxnz/starfield_crashes_amd_radeon_rx_6600/), while others are having issues with more recent versions. Some users are reporting RAM problems being related (https://gitlab.freedesktop.org/drm/amd/-/issues/2943). These issues are for different workloads on various hardware, so your underlying issue may be different, but might provide a clue. I'll try to reproduce your issue with wezterm specifically on similar hardware and get back to you.

schung-amd commented 1 month ago

Hi @Mr-Andersen, I was unable to reproduce your issue, but I may be missing something in the NixOS configuration. On a fresh install of NixOS 24.05 on an RX 6400, I installed ROCm and wezterm through the config file by

environment.systemPackages = with pkgs; [
    pkgs.rocmPackages.rpp
    pkgs.wezterm
];

in /etc/nixos/configuration.nix followed by a sudo nixos-rebuild switch, and I can use wezterm without any obvious issue. Is there a crash or hang occurring when you encounter this issue, or are the error messages the only symptom?

I've also tried enabling OpenGL in the config file, but this doesn't cause wezterm to break. Do you have other options enabled which are related to GPU acceleration?

Mr-Andersen commented 1 month ago

Hey @schung-amd, here are all the relevant options from my config:

{ ... }: {
  boot.initrd.kernelModules = [ "amdgpu" ];

  boot = {
    kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;
    kernelParams = [ "amdgpu.runpm=0" ]; # <-- this was me trying to fix the issue by reading Arch forums :)
  };

  hardware = {
    graphics = {
      enable = true;
      enable32Bit = true;
    };
  };

  services = {
    displayManager = {
      defaultSession = "xfce";
    };
    xserver = {
      enable = true;
      displayManager.lightdm.enable = true;
      desktopManager.xfce.enable = true;
    };
  };
}

My current nixpkgs commit is 12228ff1752d7b7624a54e9c1af4b222b3c1073b. I am on github:NixOS/nixpkgs/nixos-unstable branch currently, but I've starting seeing the issue while using nixos-24.05.

Here is how I experience it:

I should've provided this config since the beginning, sorry about that. This is my first serious bug report :)

schung-amd commented 1 month ago

No worries, thanks for the config information. A couple follow-up questions so I can try to reproduce the issue:

Mr-Andersen commented 1 month ago
schung-amd commented 1 month ago

Sure, a list of packages couldn't hurt. I'm more interested in the other config options, in case we could narrow this down to a config change, but if you haven't tested on a fresh install that's ok.

Mr-Andersen commented 1 month ago

I forgot there is also a hardware-configuration.nix

{ config, lib, pkgs, modulesPath, ... }:

{
  imports =
    [ (modulesPath + "/installer/scan/not-detected.nix")
    ];

  boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "usbhid" "usb_storage" "sd_mod" ];
  boot.initrd.kernelModules = [ ];
  boot.kernelModules = [ "kvm-amd" ];
  boot.extraModulePackages = [ ];

  nixpkgs.hostPlatform = lib.mkDefault "x86_64-linux";
  hardware.cpu.amd.updateMicrocode = lib.mkDefault config.hardware.enableRedistributableFirmware;
}

To clarify - by "fresh install" you mean "an install with as little customization as possible"? My install is new - I have had these issues since the first boot. Maybe that's what you meant?

schung-amd commented 1 month ago

Yes, that's what I meant, sorry for any confusion. That will make this much easier, thanks. I'll try running a high load as you suggest and see if I can reproduce the issue.

schung-amd commented 1 month ago

I am unable to reproduce the issue on XFCE, even at high load. Could you upload your configuration.nix file? You can scrub out your user information if you want to. hardware-configuration.nix should be automatically generated, but uploading this might help as well, so I can check for any discrepancies between your system and what I'm trying to repro with. Thanks!

schung-amd commented 1 month ago

Closing this as I can't reproduce the issue. If you'd still like support on this issue, feel free to reopen with your configuration.nix and hardware-configuration.nix files, and ideally with a consistent method of reproducing the issue.

Mr-Andersen commented 3 weeks ago

Sorry for leaving this thread. It seems that the issue was fixed somewhere upstream https://discourse.nixos.org/t/getting-amdgpu-error-that-crashes-desktop/50510/8?u=mr-andersen

schung-amd commented 3 weeks ago

Glad to hear your issue is resolved, thanks for the update!