Closed Ricks-Lab closed 4 years ago
@csecht Branch v3.0 now has a functional amdgpu-ls. All other utilities are not functional. Let me know your thoughts on the update format. Key changes:
I have RX 570 cards, so I guess that's why I got these errors:
~/amdgpu-utils-3.0$ ./amdgpu-ls
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 19.50-967956
3 detected GPUs, 2 are AMD, 2 may be compatible, checking...
Error: HW file does not exist: /sys/class/drm/card1/device/unique_id
Error reading parameter: unique_id, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card2/device/unique_id
Error reading parameter: unique_id, disabling for this GPU
2 confirmed compatible.
These aren't so much errors as notices, no?
The reporting of a CPU with on-board GPU is a nice feature.
The amdgpu-ls information given is good. Below is an example for a perhaps more logical grouping (to my sensibilities). To make it clear what is switched around, the line numbers for each card are the original order. For example, I moved the Card Number line to the beginning of a card's report. I used slashes to visually separate the different categories, which I think could help users digest the data better, but the separators and groupings can be whatever works best:
4 Card Number: 0
1 Vendor: INTEL
2 amdgpu-utils Compatibility: NO
3 Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
5 PCIe ID: 00:02.0
6 Driver: i915
7 Card Path: /sys/class/drm/card0/device
9 Card Number: 1
1 Vendor: AMD
2 amdgpu-utils Compatibility: Yes
3 GPU UID:
4 Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
6 Decoded Device ID: Radeon RX 570
7 Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
8 Display Card Model: Radeon RX 570
10 PCIe ID: 01:00.0
35 Link Speed: 8 GT/s
36 Link Width: 8
///////////////
11 Driver: amdgpu
12 vBIOS Version: 113-57045EHD1-M90
14 Card Path: /sys/class/drm/card1/device
13 HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
///////////////
15 Current Power (W): 75.01299999999999
16 Power Cap (W): None
17 Power Cap Range (W): [0, 180]
18 Fan Enable: 1
19 Fan PWM Mode: [1, 'Manual']
20 Current Fan PWM (%): 45
24 Fan PWM Range (%): [0, 100]
21 Current Fan Speed (rpm): 1802
22 Fan Target Speed (rpm): 1802
23 Fan Speed Range (rpm): [0, 3800]
/////////////////
5 GPU Frequency/Voltage Control Type: 1
27 Current Voltages (V): {'vddgfx': 0.906}
28 Vddc Range: ['750mV', '1150mV']
29 Current Loading (%): 100
30 Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
31 Current SCLK P-State: [6, '1071Mhz']
32 SCLK Range: ['300MHz', '2000MHz']
33 Current MCLK P-State: [2, '1850Mhz']
34 MCLK Range: ['300MHz', '2250MHz']
37 Power Performance Mode: 5-COMPUTE
38 Power Force Performance Level: manual
/////////////////
25 Current Temps (C): {'edge': 75.0}
26 Critical Temp (C): 94.0
@csecht Thanks for the recommendation! I have pushed a modification which also uses indents for clarity. Still have not include clinfo details, but was considering a subset of what was previously displayed.
Hmmmm. Something broke with that mod:
(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls
amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 are AMD, 3 may be compatible, checking...
Error: HW file does not exist: /sys/class/drm/card0/device/unique_id
Error: HW file does not exist: /sys/class/drm/card0/device/vbios_version
Warning: Error reading parameter: vbios, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/gpu_busy_percent
Warning: Error reading parameter: loading, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_dpm_sclk
Warning: Error reading parameter: sclk_ps, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_dpm_mclk
Warning: Error reading parameter: mclk_ps, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_power_profile_mode
Warning: Error reading parameter: ppm, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/power_dpm_force_performance_level
Warning: Error reading parameter: power_dpm_force, disabling for this GPU
Traceback (most recent call last):
File "./amdgpu-ls", line 127, in <module>
main()
File "./amdgpu-ls", line 103, in main
gpu_list.read_gpu_sensor_data(data_type='All')
File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1397, in read_gpu_sensor_data
v.read_gpu_sensor_data(data_type)
File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 882, in read_gpu_sensor_data
rdata = self.read_gpu_sensor(param, sensor_type=sensor_type)
File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 779, in read_gpu_sensor
file_path = os.path.join(sensor_path, sensor_file)
File "/usr/lib/python3.6/posixpath.py", line 80, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
It looks like it is trying to read sensors for the Intel card. I am not sure why as I was not calling for cards that were not compatible. I have pushed some modifications, but still don't see why it was occurring.
Here is the output format for Vega20 which demonstrates some of the benefits of the new sensor reading details:
rick@nexon:~/pydev/amdgpu-utils$ ./amdgpu-ls
rocm version: 3.0.6
AMD Wattman features enabled: 0xfffd7fff
2 detected GPUs, 1 are AMD, 1 may be compatible, checking...
1 confirmed compatible.
Card Number: 1
Vendor: AMD
amdgpu-utils Compatibility: True
Readable: True
Writeable: True
GPU UID: a5e4788172dc768b
Device ID: {'vendor': '0x1002', 'device': '0x66af', 'subsystem_vendor': '0x1458', 'subsystem_device': '0x1000'}
Decoded Device ID: Vega 20
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev c1)
Display Card Model: Vega 20
PCIe ID: 43:00.0
Link Speed: 8 GT/s
Link Width: 16
##################################################
Driver: amdgpu
vBIOS Version: 113-D3600200-106
Compute Platform: OpenCL 2.0
GPU Frequency/Voltage Control Type: 2
HWmon: /sys/class/drm/card1/device/hwmon/hwmon2
Card Path: /sys/class/drm/card1/device
##################################################
Current Power (W): 26.0
Power Cap (W): None
Power Cap Range (W): [0, 300]
Fan Enable: 0
Fan PWM Mode: [2, 'Dynamic']
Current Fan PWM (%): 0
Current Fan Speed (rpm): 0
Fan Target Speed (rpm): 0
Fan Speed Range (rpm): [0, 3850]
Fan PWM Range (%): [0, 100]
##################################################
Current Loading (%): 0
Current Temps (C): {'mem': 28.0, 'edge': 30.0, 'junction': 32.0}
Critical Temp (C): 100.0
Current Voltages (V): {'vddgfx': 0.737}
Current Clk Frequencies (MHz): {'sclk': 699.0, 'mclk': 800.0}
Current SCLK P-State: [0, '700Mhz']
SCLK Range: ['808Mhz', '2200Mhz']
Current MCLK P-State: [1, '800Mhz']
MCLK Range: ['800Mhz', '1200Mhz']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: auto
Card Number: 0
Vendor: ASPEED
amdgpu-utils Compatibility: False
Card Model: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
PCIe ID: c4:00.0
Driver: ast
Card Path: /sys/class/drm/card0/device
I download the latest mods and it's working now. It's looking good. I like the new format, which is informative and logical. In the run parameters sections, what about another level of indenting for non-variables as a way to focus attention on amdgpu-utils variables? (see below)
(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls
amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 are AMD, 3 may be compatible, checking...
3 confirmed compatible.
Error getting p-states: /sys/class/drm/card0/device/pp_od_clk_voltage
Card Number: 0
Vendor: INTEL
amdgpu-utils Compatibility: False
Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
PCIe ID: 00:02.0
Driver: i915
Card Path: /sys/class/drm/card0/device
Card Number: 1
Vendor: AMD
amdgpu-utils Compatibility: True
Readable: True
Writeable: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
Decoded Device ID: Radeon RX 570
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Display Card Model: Radeon RX 570
PCIe ID: 01:00.0
Link Speed: 8 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-57045EHD1-M90
Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
GPU Frequency/Voltage Control Type: 1
HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
Card Path: /sys/class/drm/card1/device
##################################################
Current Power (W): 75.2
Power Cap (W): None
Power Cap Range (W): [0, 180]
Fan Enable: 1
Fan PWM Mode: [1, 'Manual']
Current Fan PWM (%): 46
Current Fan Speed (rpm): 1840
Fan Target Speed (rpm): 1840
Fan Speed Range (rpm): [0, 3800]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 100
Current Memory Loading (%): 71
Current Temps (C): {'edge': 72.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 0.906}
Vddc Range: ['750mV', '1150mV']
Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
Current SCLK P-State: [6, '1071Mhz']
SCLK Range: ['300MHz', '2000MHz']
Current MCLK P-State: [2, '1850Mhz']
MCLK Range: ['300MHz', '2250MHz']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: manual
Card Number: 2
Vendor: AMD
amdgpu-utils Compatibility: True
Readable: True
Writeable: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
Decoded Device ID: Radeon RX 570
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Display Card Model: Radeon RX 570
PCIe ID: 02:00.0
Link Speed: 8 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-57045EHD1-M90
Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
GPU Frequency/Voltage Control Type: 1
HWmon: /sys/class/drm/card2/device/hwmon/hwmon4
Card Path: /sys/class/drm/card2/device
##################################################
Current Power (W): 82.1
Power Cap (W): None
Power Cap Range (W): [0, 180]
Fan Enable: 1
Fan PWM Mode: [1, 'Manual']
Current Fan PWM (%): 44
Current Fan Speed (rpm): 1723
Fan Target Speed (rpm): 1723
Fan Speed Range (rpm): [0, 3800]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 100
Current Memory Loading (%): 26
Current Temps (C): {'edge': 69.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 0.906}
Vddc Range: ['750mV', '1150mV']
Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
Current SCLK P-State: [6, '1071Mhz']
SCLK Range: ['300MHz', '2000MHz']
Current MCLK P-State: [2, '1850Mhz']
MCLK Range: ['300MHz', '2250MHz']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: manual
I just noticed that -ls lists 3 confirmed compatible GPUs, but the Intel integrated GPU (card 0) is subsequently listed as" amdgpu-utils Compatibility: False". That's confusing.
I just noticed that -ls lists 3 confirmed compatible GPUs, but the Intel integrated GPU (card 0) is subsequently listed as" amdgpu-utils Compatibility: False". That's confusing.
This problem is related to the previous error you reported. For some reason, the embedded Intel GPU is sometimes considered compatible. I am not seeing that on my system, but my embedded ASPEED GPU is not first on the list. I need to find the root cause on this before I move forward on the rewrite. I am probably going to get rid of the compatible flag and use on the readable and writable flags. This will take some time...
I have completed the elimination of the compatibility flag and now only using readable and writable status. Updated output format per suggestion. Looks good! Let me know of any issues.
Yes, nice, that makes sense. But now the voltage and clock speed range values are missing from the output: (only showing example of that for card 1, but is issue for both cards)
:~/amdgpu-utils-3.0$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 2
AMD amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 may be readable, 2 may be writable, checking...
3 total GPUs, 2 readable, 2 writable.
Card Number: 0
Vendor: INTEL
Readable: False
Writable: False
Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
PCIe ID: 00:02.0
Driver: i915
Card Path: /sys/class/drm/card0/device
Card Number: 1
Vendor: AMD
Readable: True
Writable: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
Decoded Device ID: Radeon RX 570
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Display Card Model: Radeon RX 570
PCIe ID: 01:00.0
Link Speed: 8 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-57045EHD1-M90
Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
GPU Frequency/Voltage Control Type: 0
HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
Card Path: /sys/class/drm/card1/device
##################################################
Current Power (W): 74.2
Power Cap (W): None
Power Cap Range (W): [0, 180]
Fan Enable: 1
Fan PWM Mode: [1, 'Manual']
Fan Target Speed (rpm): 1843
Current Fan Speed (rpm): 1843
Current Fan PWM (%): 46
Fan Speed Range (rpm): [0, 3800]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 100
Current Memory Loading (%): 84
Current Temps (C): {'edge': 71.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 0.906}
Vddc Range: ['', '']
Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
Current SCLK P-State: [6, '1071Mhz']
SCLK Range: ['', '']
Current MCLK P-State: [2, '1850Mhz']
MCLK Range: ['', '']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: manual
Fixed. I moved the read p-state function not realizing it read the ranges. Should be good now. Once I am convinced of it, I will start work on monitor.
I just pushed the fix.
I have ls, monitor, and plot working. I need to improve the messaging on how many cards are readable and writable, since it is confusing now. I added a --table option to ls.
Not sure how complex getting pac to work will be. Hope to work on it this weekend,
That did the trick. It looks good. I see there is a new read-write summary format:
$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 may be rw, 0 may be r-only, 0 may be w-only, checking...
3 total GPUs, 2 rw, 0 r-only, 0 w-only.
For ./amdgpu-ls --clinfo, however, the last set of output lines for each GPU has no values:
Device Name:
Device Version:
Driver Version:
Device OpenCL C Version:
Max Compute Units:
SIMD per CU:
SIMD Width:
SIMD Instruction Width:
CL Max Memory Allocation:
Max Work Item Dimensions:
Max Work Item Sizes:
Max Work Group Size:
Preferred Work Group Multiple:
...but the missing data are provided by clinfo (output is shown for only one of two GPUs):
$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3004.6)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 2
Device Name Ellesmere
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 1.2 AMD-APP (3004.6)
Driver Version 3004.6
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Board Name (AMD) Radeon RX 570 Series
Device Topology (AMD) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 32
SIMD per compute unit (AMD) 4
SIMD width (AMD) 16
SIMD instruction width (AMD) 1
Max clock frequency 1100MHz
Graphics IP (AMD) 8.0
Device Partition (core)
Max number of sub-devices 32
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Preferred work group size (AMD) 256
Max work group size (AMD) 1024
Preferred work group size multiple 64
Wavefront width (AMD) 64
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 2098307072 (1.954GiB)
Global free memory (AMD) 2029544 (1.936GiB)
Global memory channels (AMD) 8
Global memory banks per channel (AMD) 16
Global memory bank width (AMD) 256 bytes
Error Correction support No
Max memory allocation 1563635302 (1.456GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 2048 bits (256 bytes)
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 32768 (32KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
Local memory banks (AMD) 32
Max number of constant args 8
Max constant buffer size 1563635302 (1.456GiB)
Preferred constant buffer size (AMD) 16384 (16KiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1ns
Profiling timer offset since Epoch (AMD) 1580764462579627264ns (Mon Feb 3 15:14:22 2020)
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Thread trace supported (AMD) Yes
Number of async queues (AMD) 2
Max real-time compute queues (AMD) 0
Max real-time compute units (AMD) 0
SPIR versions 1.2
printf() buffer size 4194304 (4MiB)
Built-in kernels
Device Extensions cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_amd_bus_addressable_memory cl_khr_spir cl_khr_gl_event
current version:
$ clinfo -v
clinfo version 2.2.18.03.26
And yes, the amdgpu-ls --table option works, showing a snapshot of the amdgpu-monitor table. However, monitor is not working:
(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-monitor
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
Traceback (most recent call last):
File "./amdgpu-monitor", line 388, in <module>
main()
File "./amdgpu-monitor", line 315, in main
num_gpus['readable'],
KeyError: 'readable'
...nor is -plot:
(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-plot
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
Traceback (most recent call last):
File "./amdgpu-plot", line 891, in <module>
main()
File "./amdgpu-plot", line 843, in main
num_gpus['readable'],
KeyError: 'readable'
The KeyError was related to the changes in the read/write summary statement. I had not made the changes in plot and monitor. I have now completed the implementation with some optimization.
clinfo will need some work. I may include less parameters in the update. Here is a link to clinfo output for NV.
Yes, monitor and plot are now working, however, both output this warning to the terminal on every refresh/update cycle:
Warning: Invalid or disabled parameter: unique_id
With amdgpu-monitor, the warning just flashes on the terminal screen and is barely readable; with amdgpu-monitor --gui and amdgpu-plot, it eventually fills the terminal window. It does not appear with amdgpu-ls --table.
Just pushed a fix that suppress warnings for disabled sensors unless used with --debug. Still shows warning when a sensor is initially disabled.
Monitor, plot, and ls are all running smoothly. Looks good!
Read and print clinfo complete for AMD. PAC util is functional but untested.
I have fixed some issues with pac and have tested on my Vega20 system. It works, but still needs thorough testing.
I'm getting this error with --clinfo:
~/amdgpu-utils-3.0$ ./amdgpu-ls --clinfo
Traceback (most recent call last):
File "./amdgpu-ls", line 141, in <module>
main()
File "./amdgpu-ls", line 88, in main
gpu_list.set_gpu_list()
File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1179, in set_gpu_list
opencl_device_version = self.opencl_map[pcie_id]['device_version']
TypeError: list indices must be integers or slices, not str
...and this with -pac:
~/amdgpu-utils-3.0$ ./amdgpu-pac
Traceback (most recent call last):
File "./amdgpu-pac", line 1594, in <module>
main()
File "./amdgpu-pac", line 1545, in main
gpu_list.set_gpu_list()
File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1179, in set_gpu_list
opencl_device_version = self.opencl_map[pcie_id]['device_version']
TypeError: list indices must be integers or slices, not str
(edit)...and actually the same TypeError with amdgpu-ls.
I have found and fixed the issue. Let me know of any other observations.
Do you have an NV GPUs? Just want to make sure the issues impacting your intel GPU are also resolved for NV.
Nice. Most everything is working smoothing. The --clinfo option has all fields filled only for Card 2; Card 1 has missing data in the OpenCL spec section (below). The format looks good though.
Nope, no NV cards here.
craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls --clinfo
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
3 total GPUs, 2 rw, 0 r-only, 0 w-only
Card Number: 0
Vendor: INTEL
Readable: False
Writable: False
Compute: False
Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
PCIe ID: 00:02.0
Driver: i915
Card Path: /sys/class/drm/card0/device
Card Number: 1
Vendor: AMD
Readable: True
Writable: True
Compute: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
Decoded Device ID: Radeon RX 570
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Display Card Model: Radeon RX 570
PCIe ID: 01:00.0
Link Speed: 8 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-57045EHD1-M90
Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
GPU Frequency/Voltage Control Type: 1
HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
Card Path: /sys/class/drm/card1/device
##################################################
Current Power (W): 45.2
Power Cap (W): 120.0
Power Cap Range (W): [0, 180]
Fan Enable: 1
Fan PWM Mode: [1, 'Manual']
Fan Target Speed (rpm): 1177
Current Fan Speed (rpm): 1177
Current Fan PWM (%): 36
Fan Speed Range (rpm): [0, 3800]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 39
Current Memory Loading (%): 23
Current Temps (C): {'edge': 54.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 912}
Vddc Range: ['750mV', '1150mV']
Current Clk Frequencies (MHz): {'sclk': 1068.09, 'mclk': 1850.0}
Current SCLK P-State: [6, '1071Mhz']
SCLK Range: ['300MHz', '2000MHz']
Current MCLK P-State: [2, '1850Mhz']
MCLK Range: ['300MHz', '2250MHz']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: manual
##################################################
Device Name: Ellesmere
Device Version: OpenCL 1.2 AMD-APP (3004.6)
Driver Version:
Device OpenCL C Version:
Max Compute Units:
SIMD per CU:
SIMD Width:
SIMD Instruction Width:
CL Max Memory Allocation:
Max Work Item Dimensions:
Max Work Item Sizes:
Max Work Group Size:
Preferred Work Group Multiple:
Card Number: 2
Vendor: AMD
Readable: True
Writable: True
Compute: True
GPU UID:
Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
Decoded Device ID: Radeon RX 570
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
Display Card Model: Radeon RX 570
PCIe ID: 02:00.0
Link Speed: 8 GT/s
Link Width: 8
##################################################
Driver: amdgpu
vBIOS Version: 113-57045EHD1-M90
Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
GPU Frequency/Voltage Control Type: 1
HWmon: /sys/class/drm/card2/device/hwmon/hwmon4
Card Path: /sys/class/drm/card2/device
##################################################
Current Power (W): 48.1
Power Cap (W): 120.0
Power Cap Range (W): [0, 180]
Fan Enable: 1
Fan PWM Mode: [1, 'Manual']
Fan Target Speed (rpm): 1212
Current Fan Speed (rpm): 1212
Current Fan PWM (%): 36
Fan Speed Range (rpm): [0, 3800]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 25
Current Memory Loading (%): 12
Current Temps (C): {'edge': 57.0}
Critical Temp (C): 94.0
Current Voltages (V): {'vddgfx': 906}
Vddc Range: ['750mV', '1150mV']
Current Clk Frequencies (MHz): {'sclk': 1068.25, 'mclk': 1850.0}
Current SCLK P-State: [6, '1071Mhz']
SCLK Range: ['300MHz', '2000MHz']
Current MCLK P-State: [2, '1850Mhz']
MCLK Range: ['300MHz', '2250MHz']
Power Performance Mode: 5-COMPUTE
Power Force Performance Level: manual
##################################################
Device Name: Ellesmere
Device Version: OpenCL 1.2 AMD-APP (3004.6)
Driver Version: 3004.6
Device OpenCL C Version: OpenCL C 1.2
Max Compute Units: 32
SIMD per CU: 4
SIMD Width: 16
SIMD Instruction Width: 1
CL Max Memory Allocation: 1950176460
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 1024 1024
Max Work Group Size: 1024
Preferred Work Group Multiple: 64
For the output of the amdgpu-ls --clinfo option, what about omitting most of the data that is already provided by amdgpu-ls and adding a bit more OpenCL data? See example below. Doing that would make the --clinfo option function more like the other options that provide just option-specific output. Unfortunately, I don't know enough about OpenCL to say what clinfo data would be most helpful for amdgpu-utils users.
Card Number: 2
Vendor: AMD
GPU UID:
Display Card Model: Radeon RX 570
PCIe ID: 02:00.0
##################################################
Device Name: Ellesmere
Device Version: OpenCL 1.2 AMD-APP (3004.6)
Driver Version: 3004.6
Device OpenCL C Version: OpenCL C 1.2
Max Compute Units: 32
SIMD per CU: 4
SIMD Width: 16
SIMD Instruction Width: 1
CL Max Memory Allocation: 1950176460
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 1024 1024
Max Work Group Size: 1024
Preferred Work Group Multiple: 64
Half-precision Floating-point support (cl_khr_fp16)
Single-precision Floating-point support (core)
Double-precision Floating-point support (cl_khr_fp64)
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 256 bytes
Pitch alignment for 2D image buffers 256 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
I have optimized the code for reading from clinfo. This cleans up the code quite a bit and solves the missing data on the second card. For clinfo, my intent was to only include parameters relevant to setting command line arguments for the SETI apps. Not sure if Einstein apps also have command line arguments to optimize fft processing. Also, I wanted to limit parameters considered since there is no alignment in naming between AMD, NV, and INTEL.
That makes sense. Thanks. I'll ask on the E@H forums whether FFT processing can be optimized by users.
Just made the last of planned changes and tested on Vega20, Type 2 GPU and changed status to Beta Release.
@csecht Let me know if you are able to test pac scenarios on your system. Thanks!
Yes, no problem. Is there a specific scenario in mind or should I just put it through it paces?
On Feb 26, 2020, at 5:38 AM, Rick notifications@github.com wrote:
Just made the last of planned changes and tested on Vega20, Type 1 GPU and changed status to Beta Release.
@csecht https://github.com/csecht Let me know if you are able to test pac scenarios on your system. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/53?email_source=notifications&email_token=ALMVCQUSYWWJHRRLQCUV4ULREZIFBA5CNFSM4KWIZZKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM74UGY#issuecomment-591383067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALMVCQWHJZPVTFLLUF25IMTREZIFBANCNFSM4KWIZZKA.
Changes include significant logic changes in how pac writes to cards. This is especially true for type 2 cards, which I have tested, but also affects type 1. I suggest just putting it through all possible use cases. I found it easiest to generate a bunch of write sh files and review them and then in execute mode with monitor running.
I have not yet implemented the warning message concerning change fan control back to dynamic. Hope to get to it tonight.
Also, can you screen shot your pac window and replace the type 1 pac example image in the user guide? The image from my 4 GPU system is harder to read.
beta testing of branch 3 using amdgpu version: 19.50-967956
PAC Power Cap, OK to manually set, reset, max, reset.
Fan PWM: initial state was Dynamic (pwm1_enable=2)
Failed initial attempt for manual set (enable=1). (see code snipet below)
OK for setting max (enable=0) and subsequent reset (into oscillating dynamic mode, enable=2).
Conclude that PAC fails to manually set PWM (enable=1) from the dynamic state (enable=2); can only set manual fan PWM from max state. Can reset (enable=2) from manual state, just not the reverse. Failure to manual set PWM from dynamic state occurs whether the dynamic state is set at boot (thermostatic) or set by the PAC (oscillating fan); pwm1_enable=2 in both conditions.
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_5442276dfbc14cdb9987ba537e95e92f.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '119' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
sh: echo: I/O error
PAC execution complete.
SCLK masking OK PPM changes OK (These cards don't allow overclocking or underclocking, so I can't test that.) Yes, I'll replace the type 1 pac pic in the branch 3 guide.
Hmm, I completed a push request to delete the old type1.png, but don't have push access to upload the replacement file. Not sure whether I did that right. Here is the screenshot:
For the changes, you should create a new branch from v3.0, and replace the png file, commit, push and then do a pull request.
For over/under clocking, what is your current feature mask setting?
For the fan speed setting error, can you provide the details of the pac_writer file? It looks like there is a logic problem.
I'll work on the pull request tomorrow (about to go to bed).
AMD: Wattman features enabled: 0xffff7fff
pac_writer file snipet for going from manual fan PWM to "reset" (everything else in sh script is commented out):
# PWM entry: reset, Resetting to default mode of dynamic
sudo sh -c "echo '2' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable"
You might want to try a feature mask of 0xfffd7fff. I think there shouldn’t be a reason you can’t underclock.
Regarding the can't-get-there-from-here fan PWM problem: it sounds very similar to what someone said in that external discussion thread you linked over in the thermostatic issue, where they described a problem using the app fancontrol having to go through the max setting to get from a dynamic to manual mode.
Ah, cool. I'll try that feature mask tomorrow. Thanks.
I have added resets before setting manual or dynamic modes. Let me know if it makes a difference.
Yes, that worked. I can now switch to any PWM mode from any mode.
When switching from either manual or max to dynamic (pac reset), this is the what goes to pac_writer:
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_e5825f00bf98447db5cc09cecdaf5b9d.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '0' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '2' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
PAC execution complete
(I notice that in the Master, the same pac change doesn't include the pass through the 0 state.)
When going from dynamic to manual, it's this:
# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_d7be7e240c4f417693fe70a47d895801.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '1' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '117' > /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
PAC execution complete.
That different feature mask worked to allow undervolting and underclocking (haven't tried overclocking). Something new to play with! I don't know how I missed that feature mask before in the user guide. Did it change or did I just mistype it?
I see that amdgpu-monitor now includes memory load %. That's nice. Are you going to add memory load graphing to amdgpu-plot?
I bungled the replacement of the pac_type1.png file. The github instructions for creating a new branch didn't work because I never got a "Create branch" option from the branch select menu. i was able to create a new file off the 3.0 branch, but couldn't figure out how to use that to upload the image. Somehow managed to upload the png in a new Master branch, which wasn't my intention. I haven't managed to delete the old png in the 3.0 branch. I'm putting my dunce cap on and going to find corner to sit in...
I closed my first two bungled pull request for that png replacement and opened another, this time from the 3.0 branch to pull into Master. I know, not what you wanted. I Still haven't figured out how to open a new branch. Recalibrating.............
I just noticed that the amdgpu-monitor terminal window leaves this when the module is closed, which it didn't in prior versions:
^Cctrl_c_handler (ID: 2) has been caught. Setting quit flag...
Obviously doesn't affect function, just a cosmetic thing.
That different feature mask worked to allow undervolting and underclocking (haven't tried overclocking). Something new to play with! I don't know how I missed that feature mask before in the user guide. Did it change or did I just mistype it?
It is related to this pull request from another user.
I see that amdgpu-monitor now includes memory load %. That's nice. Are you going to add memory load graphing to amdgpu-plot?
I think plot is a bit too slow already. Until I find a way to make it faster, I will hold off on adding to it. I think there is a better way than concatenating dataframes. I will have to work on that in the future.
I just noticed that the amdgpu-monitor terminal window leaves this when the module is closed, which it didn't in prior versions:
^Cctrl_c_handler (ID: 2) has been caught. Setting quit flag...
Obviously doesn't affect function, just a cosmetic thing.
I have simplified the message unless in debug mode.
I found an issue in the previous version you tested (missing new line). I fixed this, and found resetting before going to manual is not necessary. I am only doing a reset before going back to dynamic. Is this working as expected now?
Yes, working as expected.
I am in the process of a major rewrite. This is mostly motivated by how much more I understand Python now, but also by innovations in how I am managing GPUs in benchMT. The implementation will be done in a way to potentially be applicable to other GPU vendors in addition to AMD. I will replace AMD compatible status with flags for readability, writability, and compute capability. Development is on Branch v3.0
Let me know of any recommendations to consider in this rewrite.