ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
333 stars 101 forks source link

(ppc64el) no-retry page fault: VM_L2_PROTECTION_FAULT #147

Closed tucnak closed 1 month ago

tucnak commented 1 year ago
-- on every rocminfo call
amdgpu: update_gpuvm_pte() failed
amdgpu: SG Table of BO is UNEXPECTEDLY NULL
amdgpu: Failed to map bo to gpuvm
amdgpu 0000:03:00.0: amdgpu: Failed to map peer:0000:03:00.0 mem_domain:

-- occurs in hipblas
amdgpu: init_user_pages: Failed to get user pages: -1
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1fff000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1ffa000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
amdgpu 0000:03:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process main pid 813201 thread main pid 813201)
amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000079c2a1ff5000 from IH client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: CB (0x0)
amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
amdgpu: sq_intr: error, se 1, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
amdgpu: Resetting wave fronts (cpsch) on dev 00000000fa7830ec

However, I'm using amdgpu that came with 6.3.4 kernel & hipblas from rocm 5.3.2; does this mean that I would have to build the kernel from this repository, and how likely that it would help?

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    POWER9
  Uuid:                    CPU-XX
  Marketing Name:          POWER9
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3800
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            32
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    64391040(0x3d68780) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx906
  Uuid:                    GPU-bc4261817337ecd7
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 26273(0x66a1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1725
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            60
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33538048(0x1ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
ppanchad-amd commented 3 months ago

@tucnak Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd commented 1 month ago

@tucnak Closing ticket. Please feel free to re-open ticket if you still see the issue with the latest ROCm. Thanks!