Closed DanaGoyette closed 2 years ago
Thanks for raising a bug report. I think your provided enough details to figure it out. I will let you know when I have pushed an update.
I pushed an update that should have fixed it but could not test it. Let me know your observations.
Thanks, now gpu-ls says this (after rebooting to set ppfeaturemask).
Since I haven't really used these tools before, I don't know what to expect, but it makes sense that there's no WattMan: the same is true on Windows, you can't really tune anything.
Detected GPUs: AMD: 1
amdgpu/rocm version: UNKNOWN
AMD: Wattman features not enabled: 0xfff7bfff, See README file.
1 total GPUs, 0 rw, 0 r-only, 0 w-only
Card Number: None
Vendor: AMD
Readable: False
Writable: False
Compute: False
Device ID: {'device': '0x67e3', 'subsystem_device': '0x0b0d', 'subsystem_vendor': '0x1002', 'vendor': '0x1002'}
Decoded Device ID: Baffin [Radeon Pro WX 4100]
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]
PCIe ID: 0004:01:00.0
Driver: amdgpu
GPU Type: Unsupported
HWmon: None
Card Path: None
System Card Path: /sys/devices/pci0004:01/0004:01:00.0
Full debug log:
DEBUG:gpu-utils:env.set_args:Install type: debian
DEBUG:gpu-utils:env.set_args:Command line arguments:
Namespace(about=False, short=False, table=False, pstates=False, ppm=False, clinfo=False, no_fan=False, debug=True)
DEBUG:gpu-utils:env.set_args:Local TZ: PDT
DEBUG:gpu-utils:env.set_args:pciid path set to: /usr/share/misc/pci.ids
DEBUG:gpu-utils:env.set_args:Icon path set to: /usr/share/rickslab-gpu-utils/icons
DEBUG:gpu-utils:gpu-ls.main:########## gpu-ls 3.6.2
DEBUG:gpu-utils:env.check_env:Using python: 3.9.7
DEBUG:gpu-utils:env.check_env:Using Linux Kernel: 5.15.28-cex7
DEBUG:gpu-utils:env.check_env:Using Linux Distro: Ubuntu
DEBUG:gpu-utils:env.check_env:Linux Distro Description: Ubuntu 21.10
DEBUG:gpu-utils:env.check_env:Distro: Ubuntu, Ubuntu 21.10
DEBUG:gpu-utils:env.check_env:lspci path: /usr/bin/lspci
DEBUG:gpu-utils:env.check_env:clinfo path: /usr/bin/clinfo
DEBUG:gpu-utils:env.check_env:Ubuntu package query tool: /usr/bin/dpkg
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_NAME: [AMD Radeon (TM) Pro WX 4100 (POLARIS11, DRM 3.42.0, 5.15.28-cex7, LLVM 12.0.1)]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_VERSION: [OpenCL 1.1 Mesa 21.2.6]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DRIVER_VERSION: [21.2.6]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_OPENCL_C_VERSION: [OpenCL C 1.1]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_MAX_COMPUTE_UNITS: [16]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: [3]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_MAX_WORK_ITEM_SIZES: [256 256 256]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_MAX_WORK_GROUP_SIZE: [256]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE: [64]
DEBUG:gpu-utils:GPUmodule.read_gpu_opencl_data:openCL map CL_DEVICE_MAX_MEM_ALLOC_SIZE: [3435973836]
DEBUG:gpu-utils:GPUmodule.set_gpu_list:OpenCL map: {None: {'prf_wg_multiple': '64', 'max_wg_size': '256', 'prf_wg_size': None, 'max_wi_sizes': '256 256 256', 'max_wi_dim': '3', 'max_mem_allocation': '3435973836', 'simd_ins_width': None, 'simd_width': None, 'simd_per_cu': None, 'max_cu': '16', 'device_name': 'AMD Radeon (TM) Pro WX 4100 (POLARIS11, DRM 3.42.0, 5.15.28-cex7, LLVM 12.0.1)', 'opencl_version': 'OpenCL C 1.1', 'driver_version': '21.2.6', 'device_version': 'OpenCL 1.1 Mesa 21.2.6'}}
DEBUG:gpu-utils:env.read_amdfeaturemask:Raw Featuremask string: [0xfff7bfff]
DEBUG:gpu-utils:env.read_amdfeaturemask:AMD featuremask: 0xfff7bfff
DEBUG:gpu-utils:GPUmodule.get_gpu_pci_list:Found GPU pci: 0004:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]
DEBUG:gpu-utils:GPUmodule.set_gpu_list:Found 1 GPUs
DEBUG:gpu-utils:GPUmodule.add:Added GPU Item e8020b1d36c540ccb5aa3eeedb97fe8e to GPU List
DEBUG:gpu-utils:GPUmodule.set_gpu_list:GPU: 0004:01:00.0
DEBUG:gpu-utils:GPUmodule.set_gpu_list:lspci output items:
['0004:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]', '\tSubsystem: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]', '\tKernel driver in use: amdgpu', '\tKernel modules: amdgpu', '']
DEBUG:gpu-utils:GPUmodule.set_gpu_list:gpu_name: [Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]]
DEBUG:gpu-utils:GPUmodule.set_gpu_list:sysfpath: /sys/devices/pci0004:01/0004:01:00.0
device_dir: /sys/class/drm/card0/device
DEBUG:gpu-utils:GPUmodule.set_gpu_list:card_path not set for: 0004:01:00.0
DEBUG:gpu-utils:GPUmodule.set_gpu_list:GPU[e8020b1d36c540ccb5aa3eeedb97fe8e] type set to Unsupported
DEBUG:gpu-utils:GPUmodule.set_gpu_list:/sys/device file search found match to pcie_id 0004:01:00.0:
['/sys/devices/pci0004:01/0004:01:00.0']
DEBUG:gpu-utils:GPUmodule.populate_prm_from_dict:prm dict:
{'pcie_id': '0004:01:00.0', 'model': 'Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]', 'vendor': <vendor.AMD: 3>, 'driver': 'amdgpu', 'card_path': '', 'sys_card_path': '/sys/devices/pci0004:01/0004:01:00.0', 'gpu_type': <type.Unsupported: 2>, 'hwmon_path': '', 'readable': False, 'writable': False, 'compute': False, 'compute_platform': None}
DEBUG:gpu-utils:GPUmodule.set_gpu_list:Card flags: readable: False, writable: False, type: Unsupported
DEBUG:gpu-utils:GPUmodule.read_gpu_sensor_generic:sensor path set to [/sys/devices/pci0004:01/0004:01:00.0]
DEBUG:gpu-utils:GPUmodule.set_params_value:Set param value: [['0x1002', '0x67e3', '0x1002', '0x0b0d']], type: [<class 'list'>]
DEBUG:gpu-utils:GPUmodule.wattman_status:AMD featuremask: 0xfff7bfff
Maybe there is also a difference in the way your system defines/uses what I am calling the card_path, which typically contains a link to the system device file. Can you check the contents of "/sys/class/drm/"? Also the contents of what gpu-ls is reporting for the "system card path" would also be useful.
AMD gpu's before Fiji are not well supported in the linux drivers, so available capability for Baffin may be limited. But I am motivated to figure out how to deal with card path and device path for this type of installation.
/sys/class/drm:
lrwxrwxrwx 1 root root 0 Mar 21 17:16 card0 -> ../../devices/pci0004:01/0004:01:00.0/drm/card0
lrwxrwxrwx 1 root root 0 Mar 21 17:16 card0-DP-1 -> ../../devices/pci0004:01/0004:01:00.0/drm/card0/card0-DP-1
lrwxrwxrwx 1 root root 0 Mar 21 17:16 card0-DP-2 -> ../../devices/pci0004:01/0004:01:00.0/drm/card0/card0-DP-2
lrwxrwxrwx 1 root root 0 Mar 21 17:16 card0-DP-3 -> ../../devices/pci0004:01/0004:01:00.0/drm/card0/card0-DP-3
lrwxrwxrwx 1 root root 0 Mar 21 17:16 card0-DP-4 -> ../../devices/pci0004:01/0004:01:00.0/drm/card0/card0-DP-4
lrwxrwxrwx 1 root root 0 Mar 21 17:16 renderD128 -> ../../devices/pci0004:01/0004:01:00.0/drm/renderD128
-r--r--r-- 1 root root 4096 Mar 21 17:16 version
/sys/class/hwmon:
lrwxrwxrwx 1 root root 0 Mar 21 17:16 hwmon0 -> ../../devices/virtual/thermal/thermal_zone0/hwmon0
lrwxrwxrwx 1 root root 0 Mar 21 17:16 hwmon1 -> ../../devices/pci0004:01/0004:01:00.0/hwmon/hwmon1
/sys/class/hwmon/hwmon1/:
lrwxrwxrwx 1 root root 0 Mar 21 17:16 device -> ../../../0004:01:00.0
-rw-r--r-- 1 root root 4096 Mar 21 17:45 fan1_enable
-r--r--r-- 1 root root 4096 Mar 21 17:16 fan1_input
-r--r--r-- 1 root root 4096 Mar 21 17:16 fan1_max
-r--r--r-- 1 root root 4096 Mar 21 17:16 fan1_min
-rw-r--r-- 1 root root 4096 Mar 21 17:45 fan1_target
-r--r--r-- 1 root root 4096 Mar 21 17:45 freq1_input
-r--r--r-- 1 root root 4096 Mar 21 17:45 freq1_label
-r--r--r-- 1 root root 4096 Mar 21 17:45 freq2_input
-r--r--r-- 1 root root 4096 Mar 21 17:45 freq2_label
-r--r--r-- 1 root root 4096 Mar 21 17:16 in0_input
-r--r--r-- 1 root root 4096 Mar 21 17:16 in0_label
-r--r--r-- 1 root root 4096 Mar 21 17:16 name
drwxr-xr-x 2 root root 0 Mar 21 17:45 power
-r--r--r-- 1 root root 4096 Mar 21 17:16 power1_average
-rw-r--r-- 1 root root 4096 Mar 21 17:16 power1_cap
-r--r--r-- 1 root root 4096 Mar 21 17:45 power1_cap_default
-r--r--r-- 1 root root 4096 Mar 21 17:45 power1_cap_max
-r--r--r-- 1 root root 4096 Mar 21 17:45 power1_cap_min
-r--r--r-- 1 root root 4096 Mar 21 17:16 power1_label
-rw-r--r-- 1 root root 4096 Mar 21 17:45 pwm1
-rw-r--r-- 1 root root 4096 Mar 21 17:45 pwm1_enable
-r--r--r-- 1 root root 4096 Mar 21 17:45 pwm1_max
-r--r--r-- 1 root root 4096 Mar 21 17:45 pwm1_min
lrwxrwxrwx 1 root root 0 Mar 21 17:16 subsystem -> ../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Mar 21 17:16 temp1_crit
-r--r--r-- 1 root root 4096 Mar 21 17:16 temp1_crit_hyst
-r--r--r-- 1 root root 4096 Mar 21 17:16 temp1_input
-r--r--r-- 1 root root 4096 Mar 21 17:16 temp1_label
-rw-r--r-- 1 root root 4096 Mar 21 17:16 uevent
Speaking of PCIe domains, the other place I've seen them is on multi-socket boards, but those are a different kind of expensive.
Are PCI domains unique to multi-socket boards? My first case of seeing it in this project. It would be cool to have a multi-socket system up and running, but with 64 core single socket system being available, I had not considered the cost of dual socket.
I just pushed a quick update. It adds capability to handle domain in setting card path. Let me know if it works. Once we get this working, It would be best if I refactored this section of code.
In my ARM board's case, it's not really multi-socket, it just has the PCIe root hidden in firmware because of quirks.
Thanks for the additional fix, now it sees plenty of info. I'll paste the output, but not the (now larger) debug log.
Note that at the moment, I'm booted with amdgpu.bapm=0
, as an attempt to work around odd hangs.
Ubuntu: Validated
Detected GPUs: AMD: 1
amdgpu/rocm version: UNKNOWN
AMD: Wattman features not enabled: 0xfff7bfff, See README file.
1 total GPUs, 0 rw, 1 r-only, 0 w-only
Card Number: 0
Vendor: AMD
Readable: True
Writable: False
Compute: False
GPU UID: None
Device ID: {'device': '0x67e3', 'subsystem_device': '0x0b0d', 'subsystem_vendor': '0x1002', 'vendor': '0x1002'}
Decoded Device ID: Baffin [Radeon Pro WX 4100]
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon Pro WX 4100]
Display Card Model: Baffin Pro WX 4100
PCIe ID: 0004:01:00.0
Link Speed: 8.0 GT/s PCIe
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-D0150600-103
Compute Platform: None
GPU Type: Modern
HWmon: /sys/class/drm/card0/device/hwmon/hwmon1
Card Path: /sys/class/drm/card0/device
System Card Path: /sys/devices/pci0004:01/0004:01:00.0
##################################################
Current Power (W): 6.146
Power Cap (W): 35.000
Power Cap Range (W): [0, 35]
Fan Enable: 0
Fan PWM Mode: [2, 'Dynamic']
Fan Target Speed (rpm): 2035
Current Fan Speed (rpm): 2035
Current Fan PWM (%): 19
Fan Speed Range (rpm): [1600, 6000]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 0
Current Memory Loading (%): 1
Current GTT Memory Usage (%): 0.603
Current GTT Memory Used (GB): 0.024
Total GTT Memory (GB): 4.000
Current VRAM Usage (%): 0.895
Current VRAM Used (GB): 0.036
Total VRAM (GB): 4.000
Current Temps (C): {'edge': 25.0}
Critical Temps (C): {'edge': 99.0}
Current Voltages (V): {'vddgfx': 718}
Current Clk Frequencies (MHz): {'mclk': 300.0, 'sclk': 214.0}
Current SCLK P-State: [0, '214Mhz']
Current MCLK P-State: [0, '300Mhz']
Power Profile Mode: 1-3D_FULL_SCREEN
Power DPM Force Performance Level: auto
Can you check if the file pp_od_clk_voltage
exists in the card path directory? Just want to verify if there are other issues in writing to the card. This is the driver file that is written to for under/overclocking the GPU. In older cards, I expect writing is not supported and the file doesn't exist.
3.6.3 released with this fix.
I have a Radeon Pro WX 4100 in an ARM64 machine (Honeycomb), where the slots are registered as separate PCIe domains.
The Ubuntu Impish package, as well as the one from your Debian repo, can't find my GPU.
The debug output is this:
The PCI bus addresses look like this:
The pattern for PCI bus IDs seems to look for just
bb:dd.f
, notxxxx:bb:dd.f
.If I naively edit the
PCI_ADD
pattern to add the domain section, I get this instead:Debug output:
My
/sys/devices/
looks like this: