Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

V3.0 Rewrite #53

Closed Ricks-Lab closed 4 years ago

Ricks-Lab commented 4 years ago

I am in the process of a major rewrite. This is mostly motivated by how much more I understand Python now, but also by innovations in how I am managing GPUs in benchMT. The implementation will be done in a way to potentially be applicable to other GPU vendors in addition to AMD. I will replace AMD compatible status with flags for readability, writability, and compute capability. Development is on Branch v3.0

Let me know of any recommendations to consider in this rewrite.

Ricks-Lab commented 4 years ago

@csecht Branch v3.0 now has a functional amdgpu-ls. All other utilities are not functional. Let me know your thoughts on the update format. Key changes:

csecht commented 4 years ago

I have RX 570 cards, so I guess that's why I got these errors:

~/amdgpu-utils-3.0$ ./amdgpu-ls
AMD Wattman features enabled: 0xffff7fff
amdgpu version: 19.50-967956
3 detected GPUs, 2 are AMD, 2 may be compatible, checking...
Error: HW file does not exist: /sys/class/drm/card1/device/unique_id
Error reading parameter: unique_id, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card2/device/unique_id
Error reading parameter: unique_id, disabling for this GPU
2 confirmed compatible.

These aren't so much errors as notices, no?

The reporting of a CPU with on-board GPU is a nice feature.

The amdgpu-ls information given is good. Below is an example for a perhaps more logical grouping (to my sensibilities). To make it clear what is switched around, the line numbers for each card are the original order. For example, I moved the Card Number line to the beginning of a card's report. I used slashes to visually separate the different categories, which I think could help users digest the data better, but the separators and groupings can be whatever works best:

4 Card Number: 0
1 Vendor: INTEL
2 amdgpu-utils Compatibility: NO
3 Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
5 PCIe ID: 00:02.0
6 Driver: i915
7 Card Path: /sys/class/drm/card0/device

9 Card Number: 1
1 Vendor: AMD
2 amdgpu-utils Compatibility: Yes
3 GPU UID: 
4 Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
6 Decoded Device ID: Radeon RX 570
7 Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
8 Display Card Model: Radeon RX 570
10 PCIe ID: 01:00.0
35 Link Speed: 8 GT/s
36 Link Width: 8
///////////////
11 Driver: amdgpu
12 vBIOS Version: 113-57045EHD1-M90
14 Card Path: /sys/class/drm/card1/device
13 HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
///////////////
15 Current Power (W): 75.01299999999999
16 Power Cap (W): None
17 Power Cap Range (W): [0, 180]
18 Fan Enable: 1
19 Fan PWM Mode: [1, 'Manual']
20 Current Fan PWM (%): 45
24 Fan PWM Range (%): [0, 100]
21 Current Fan Speed (rpm): 1802
22 Fan Target Speed (rpm): 1802
23 Fan Speed Range (rpm): [0, 3800]
/////////////////
5 GPU Frequency/Voltage Control Type: 1
27 Current Voltages (V): {'vddgfx': 0.906}
28 Vddc Range: ['750mV', '1150mV']
29 Current Loading (%): 100
30 Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
31 Current SCLK P-State: [6, '1071Mhz']
32 SCLK Range: ['300MHz', '2000MHz']
33 Current MCLK P-State: [2, '1850Mhz']
34 MCLK Range: ['300MHz', '2250MHz']
37 Power Performance Mode: 5-COMPUTE
38 Power Force Performance Level: manual
/////////////////
25 Current Temps (C): {'edge': 75.0}
26 Critical Temp (C): 94.0
Ricks-Lab commented 4 years ago

@csecht Thanks for the recommendation! I have pushed a modification which also uses indents for clarity. Still have not include clinfo details, but was considering a subset of what was previously displayed.

csecht commented 4 years ago

Hmmmm. Something broke with that mod:

(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls
amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 are AMD, 3 may be compatible, checking...
Error: HW file does not exist: /sys/class/drm/card0/device/unique_id
Error: HW file does not exist: /sys/class/drm/card0/device/vbios_version
Warning: Error reading parameter: vbios, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/gpu_busy_percent
Warning: Error reading parameter: loading, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_dpm_sclk
Warning: Error reading parameter: sclk_ps, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_dpm_mclk
Warning: Error reading parameter: mclk_ps, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/pp_power_profile_mode
Warning: Error reading parameter: ppm, disabling for this GPU
Error: HW file does not exist: /sys/class/drm/card0/device/power_dpm_force_performance_level
Warning: Error reading parameter: power_dpm_force, disabling for this GPU
Traceback (most recent call last):
  File "./amdgpu-ls", line 127, in <module>
    main()
  File "./amdgpu-ls", line 103, in main
    gpu_list.read_gpu_sensor_data(data_type='All')
  File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1397, in read_gpu_sensor_data
    v.read_gpu_sensor_data(data_type)
  File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 882, in read_gpu_sensor_data
    rdata = self.read_gpu_sensor(param, sensor_type=sensor_type)
  File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 779, in read_gpu_sensor
    file_path = os.path.join(sensor_path, sensor_file)
  File "/usr/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Ricks-Lab commented 4 years ago

It looks like it is trying to read sensors for the Intel card. I am not sure why as I was not calling for cards that were not compatible. I have pushed some modifications, but still don't see why it was occurring.

Ricks-Lab commented 4 years ago

Here is the output format for Vega20 which demonstrates some of the benefits of the new sensor reading details:

rick@nexon:~/pydev/amdgpu-utils$ ./amdgpu-ls
rocm version: 3.0.6
AMD Wattman features enabled: 0xfffd7fff
2 detected GPUs, 1 are AMD, 1 may be compatible, checking...
1 confirmed compatible.

Card Number: 1
   Vendor: AMD
   amdgpu-utils Compatibility: True
   Readable: True
   Writeable: True
   GPU UID: a5e4788172dc768b
   Device ID: {'vendor': '0x1002', 'device': '0x66af', 'subsystem_vendor': '0x1458', 'subsystem_device': '0x1000'}
   Decoded Device ID: Vega 20
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev c1)
   Display Card Model: Vega 20
   PCIe ID: 43:00.0
   Link Speed: 8 GT/s
   Link Width: 16
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-D3600200-106
   Compute Platform: OpenCL 2.0
   GPU Frequency/Voltage Control Type: 2
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon2
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 26.0
   Power Cap (W): None
   Power Cap Range (W): [0, 300]
   Fan Enable: 0
   Fan PWM Mode: [2, 'Dynamic']
   Current Fan PWM (%): 0
   Current Fan Speed (rpm): 0
   Fan Target Speed (rpm): 0
   Fan Speed Range (rpm): [0, 3850]
   Fan PWM Range (%): [0, 100]
   ##################################################
   Current Loading (%): 0
   Current Temps (C): {'mem': 28.0, 'edge': 30.0, 'junction': 32.0}
   Critical Temp (C): 100.0
   Current Voltages (V): {'vddgfx': 0.737}
   Current Clk Frequencies (MHz): {'sclk': 699.0, 'mclk': 800.0}
   Current SCLK P-State: [0, '700Mhz']
   SCLK Range: ['808Mhz', '2200Mhz']
   Current MCLK P-State: [1, '800Mhz']
   MCLK Range: ['800Mhz', '1200Mhz']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: auto

Card Number: 0
   Vendor: ASPEED
   amdgpu-utils Compatibility: False
   Card Model: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
   PCIe ID: c4:00.0
   Driver: ast
   Card Path: /sys/class/drm/card0/device
csecht commented 4 years ago

I download the latest mods and it's working now. It's looking good. I like the new format, which is informative and logical. In the run parameters sections, what about another level of indenting for non-variables as a way to focus attention on amdgpu-utils variables? (see below)

(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls
amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 are AMD, 3 may be compatible, checking...
3 confirmed compatible.

Error getting p-states: /sys/class/drm/card0/device/pp_od_clk_voltage
Card Number: 0
   Vendor: INTEL
   amdgpu-utils Compatibility: False
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   Card Path: /sys/class/drm/card0/device

Card Number: 1
   Vendor: AMD
   amdgpu-utils Compatibility: True
   Readable: True
   Writeable: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
   Decoded Device ID: Radeon RX 570
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
   Display Card Model: Radeon RX 570
   PCIe ID: 01:00.0
   Link Speed: 8 GT/s
   Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-57045EHD1-M90
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Frequency/Voltage Control Type: 1
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 75.2
   Power Cap (W): None
        Power Cap Range (W): [0, 180]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Current Fan PWM (%): 46
   Current Fan Speed (rpm): 1840
   Fan Target Speed (rpm): 1840
        Fan Speed Range (rpm): [0, 3800]
        Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 100
   Current Memory Loading (%): 71
   Current Temps (C): {'edge': 72.0}
        Critical Temp (C): 94.0
   Current Voltages (V): {'vddgfx': 0.906}
        Vddc Range: ['750mV', '1150mV']
   Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
   Current SCLK P-State: [6, '1071Mhz']
        SCLK Range: ['300MHz', '2000MHz']
   Current MCLK P-State: [2, '1850Mhz']
        MCLK Range: ['300MHz', '2250MHz']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: manual

Card Number: 2
   Vendor: AMD
   amdgpu-utils Compatibility: True
   Readable: True
   Writeable: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
   Decoded Device ID: Radeon RX 570
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
   Display Card Model: Radeon RX 570
   PCIe ID: 02:00.0
   Link Speed: 8 GT/s
   Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-57045EHD1-M90
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Frequency/Voltage Control Type: 1
   HWmon: /sys/class/drm/card2/device/hwmon/hwmon4
   Card Path: /sys/class/drm/card2/device
   ##################################################
   Current Power (W): 82.1
   Power Cap (W): None
        Power Cap Range (W): [0, 180]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Current Fan PWM (%): 44
   Current Fan Speed (rpm): 1723
   Fan Target Speed (rpm): 1723
        Fan Speed Range (rpm): [0, 3800]
        Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 100
   Current Memory Loading (%): 26
   Current Temps (C): {'edge': 69.0}
        Critical Temp (C): 94.0
   Current Voltages (V): {'vddgfx': 0.906}
        Vddc Range: ['750mV', '1150mV']
   Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
   Current SCLK P-State: [6, '1071Mhz']
        SCLK Range: ['300MHz', '2000MHz']
   Current MCLK P-State: [2, '1850Mhz']
        MCLK Range: ['300MHz', '2250MHz']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: manual
csecht commented 4 years ago

I just noticed that -ls lists 3 confirmed compatible GPUs, but the Intel integrated GPU (card 0) is subsequently listed as" amdgpu-utils Compatibility: False". That's confusing.

Ricks-Lab commented 4 years ago

I just noticed that -ls lists 3 confirmed compatible GPUs, but the Intel integrated GPU (card 0) is subsequently listed as" amdgpu-utils Compatibility: False". That's confusing.

This problem is related to the previous error you reported. For some reason, the embedded Intel GPU is sometimes considered compatible. I am not seeing that on my system, but my embedded ASPEED GPU is not first on the list. I need to find the root cause on this before I move forward on the rewrite. I am probably going to get rid of the compatible flag and use on the readable and writable flags. This will take some time...

Ricks-Lab commented 4 years ago

I have completed the elimination of the compatibility flag and now only using readable and writable status. Updated output format per suggestion. Looks good! Let me know of any issues.

csecht commented 4 years ago

Yes, nice, that makes sense. But now the voltage and clock speed range values are missing from the output: (only showing example of that for card 1, but is issue for both cards)

:~/amdgpu-utils-3.0$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 2
AMD amdgpu version: 19.50-967956
AMD Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 may be readable, 2 may be writable, checking...
3 total GPUs, 2 readable, 2 writable.

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   Card Path: /sys/class/drm/card0/device

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
   Decoded Device ID: Radeon RX 570
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
   Display Card Model: Radeon RX 570
   PCIe ID: 01:00.0
      Link Speed: 8 GT/s
      Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-57045EHD1-M90
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Frequency/Voltage Control Type: 0
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 74.2
   Power Cap (W): None
      Power Cap Range (W): [0, 180]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Fan Target Speed (rpm): 1843
   Current Fan Speed (rpm): 1843
   Current Fan PWM (%): 46
      Fan Speed Range (rpm): [0, 3800]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 100
   Current Memory Loading (%): 84
   Current Temps (C): {'edge': 71.0}
      Critical Temp (C): 94.0
   Current Voltages (V): {'vddgfx': 0.906}
      Vddc Range: ['', '']
   Current Clk Frequencies (MHz): {'sclk': 1071.0, 'mclk': 1850.0}
   Current SCLK P-State: [6, '1071Mhz']
      SCLK Range: ['', '']
   Current MCLK P-State: [2, '1850Mhz']
      MCLK Range: ['', '']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: manual
Ricks-Lab commented 4 years ago

Fixed. I moved the read p-state function not realizing it read the ranges. Should be good now. Once I am convinced of it, I will start work on monitor.

Ricks-Lab commented 4 years ago

I just pushed the fix.

Ricks-Lab commented 4 years ago

I have ls, monitor, and plot working. I need to improve the messaging on how many cards are readable and writable, since it is confusing now. I added a --table option to ls.

Not sure how complex getting pac to work will be. Hope to work on it this weekend,

csecht commented 4 years ago

That did the trick. It looks good. I see there is a new read-write summary format:

$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
3 detected GPUs, 2 may be rw, 0 may be r-only, 0 may be w-only, checking...
3 total GPUs, 2 rw, 0 r-only, 0 w-only.

For ./amdgpu-ls --clinfo, however, the last set of output lines for each GPU has no values:

Device Name: 
Device Version: 
Driver Version: 
Device OpenCL C Version: 
Max Compute Units: 
SIMD per CU: 
SIMD Width: 
SIMD Instruction Width: 
CL Max Memory Allocation: 
Max Work Item Dimensions: 
Max Work Item Sizes: 
Max Work Group Size: 
Preferred Work Group Multiple: 

...but the missing data are provided by clinfo (output is shown for only one of two GPUs):

$ clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3004.6)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 2
  Device Name                                     Ellesmere
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 1.2 AMD-APP (3004.6)
  Driver Version                                  3004.6
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Board Name (AMD)                         Radeon RX 570 Series
  Device Topology (AMD)                           PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               32
  SIMD per compute unit (AMD)                     4
  SIMD width (AMD)                                16
  SIMD instruction width (AMD)                    1
  Max clock frequency                             1100MHz
  Graphics IP (AMD)                               8.0
  Device Partition                                (core)
    Max number of sub-devices                     32
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             256
  Preferred work group size (AMD)                 256
  Max work group size (AMD)                       1024
  Preferred work group size multiple              64
  Wavefront width (AMD)                           64
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                2 / 2       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 1 / 1        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              2098307072 (1.954GiB)
  Global free memory (AMD)                        2029544 (1.936GiB)
  Global memory channels (AMD)                    8
  Global memory banks per channel (AMD)           16
  Global memory bank width (AMD)                  256 bytes
  Error Correction support                        No
  Max memory allocation                           1563635302 (1.456GiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       2048 bits (256 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        16384 (16KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               32768 (32KiB)
  Local memory syze per CU (AMD)                  65536 (64KiB)
  Local memory banks (AMD)                        32
  Max number of constant args                     8
  Max constant buffer size                        1563635302 (1.456GiB)
  Preferred constant buffer size (AMD)            16384 (16KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Profiling timer offset since Epoch (AMD)        1580764462579627264ns (Mon Feb  3 15:14:22 2020)
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Thread trace supported (AMD)                  Yes
    Number of async queues (AMD)                  2
    Max real-time compute queues (AMD)            0
    Max real-time compute units (AMD)             0
    SPIR versions                                 1.2
  printf() buffer size                            4194304 (4MiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_amd_bus_addressable_memory cl_khr_spir cl_khr_gl_event 

current version:

$ clinfo -v
clinfo version 2.2.18.03.26
csecht commented 4 years ago

And yes, the amdgpu-ls --table option works, showing a snapshot of the amdgpu-monitor table. However, monitor is not working:

(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-monitor
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
Traceback (most recent call last):
  File "./amdgpu-monitor", line 388, in <module>
    main()
  File "./amdgpu-monitor", line 315, in main
    num_gpus['readable'],
KeyError: 'readable'

...nor is -plot:

(amdgpu-utils-env) craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-plot
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
Traceback (most recent call last):
  File "./amdgpu-plot", line 891, in <module>
    main()
  File "./amdgpu-plot", line 843, in main
    num_gpus['readable'],
KeyError: 'readable'
Ricks-Lab commented 4 years ago

The KeyError was related to the changes in the read/write summary statement. I had not made the changes in plot and monitor. I have now completed the implementation with some optimization.

clinfo will need some work. I may include less parameters in the update. Here is a link to clinfo output for NV.

csecht commented 4 years ago

Yes, monitor and plot are now working, however, both output this warning to the terminal on every refresh/update cycle:

Warning: Invalid or disabled parameter: unique_id

With amdgpu-monitor, the warning just flashes on the terminal screen and is barely readable; with amdgpu-monitor --gui and amdgpu-plot, it eventually fills the terminal window. It does not appear with amdgpu-ls --table.

Ricks-Lab commented 4 years ago

Just pushed a fix that suppress warnings for disabled sensors unless used with --debug. Still shows warning when a sensor is initially disabled.

csecht commented 4 years ago

Monitor, plot, and ls are all running smoothly. Looks good!

Ricks-Lab commented 4 years ago

Read and print clinfo complete for AMD. PAC util is functional but untested.

Ricks-Lab commented 4 years ago

I have fixed some issues with pac and have tested on my Vega20 system. It works, but still needs thorough testing.

csecht commented 4 years ago

I'm getting this error with --clinfo:

~/amdgpu-utils-3.0$ ./amdgpu-ls --clinfo
Traceback (most recent call last):
  File "./amdgpu-ls", line 141, in <module>
    main()
  File "./amdgpu-ls", line 88, in main
    gpu_list.set_gpu_list()
  File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1179, in set_gpu_list
    opencl_device_version = self.opencl_map[pcie_id]['device_version']
TypeError: list indices must be integers or slices, not str

...and this with -pac:

~/amdgpu-utils-3.0$ ./amdgpu-pac
Traceback (most recent call last):
  File "./amdgpu-pac", line 1594, in <module>
    main()
  File "./amdgpu-pac", line 1545, in main
    gpu_list.set_gpu_list()
  File "/home/craig/amdgpu-utils-3.0/GPUmodules/GPUmodule.py", line 1179, in set_gpu_list
    opencl_device_version = self.opencl_map[pcie_id]['device_version']
TypeError: list indices must be integers or slices, not str

(edit)...and actually the same TypeError with amdgpu-ls.

Ricks-Lab commented 4 years ago

I have found and fixed the issue. Let me know of any other observations.

Do you have an NV GPUs? Just want to make sure the issues impacting your intel GPU are also resolved for NV.

csecht commented 4 years ago

Nice. Most everything is working smoothing. The --clinfo option has all fields filled only for Card 2; Card 1 has missing data in the OpenCL spec section (below). The format looks good though.

Nope, no NV cards here.

craig@craig-Linux2:~/amdgpu-utils-3.0$ ./amdgpu-ls --clinfo
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 19.50-967956
AMD: Wattman features enabled: 0xffff7fff
3 total GPUs, 2 rw, 0 r-only, 0 w-only

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   Card Path: /sys/class/drm/card0/device

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
   Decoded Device ID: Radeon RX 570
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
   Display Card Model: Radeon RX 570
   PCIe ID: 01:00.0
      Link Speed: 8 GT/s
      Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-57045EHD1-M90
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Frequency/Voltage Control Type: 1
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   ##################################################
   Current Power (W): 45.2
   Power Cap (W): 120.0
      Power Cap Range (W): [0, 180]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Fan Target Speed (rpm): 1177
   Current Fan Speed (rpm): 1177
   Current Fan PWM (%): 36
      Fan Speed Range (rpm): [0, 3800]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 39
   Current Memory Loading (%): 23
   Current Temps (C): {'edge': 54.0}
      Critical Temp (C): 94.0
   Current Voltages (V): {'vddgfx': 912}
      Vddc Range: ['750mV', '1150mV']
   Current Clk Frequencies (MHz): {'sclk': 1068.09, 'mclk': 1850.0}
   Current SCLK P-State: [6, '1071Mhz']
      SCLK Range: ['300MHz', '2000MHz']
   Current MCLK P-State: [2, '1850Mhz']
      MCLK Range: ['300MHz', '2250MHz']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: manual
   ##################################################
   Device Name: Ellesmere
   Device Version: OpenCL 1.2 AMD-APP (3004.6)
   Driver Version: 
   Device OpenCL C Version: 
   Max Compute Units: 
   SIMD per CU: 
   SIMD Width: 
   SIMD Instruction Width: 
   CL Max Memory Allocation: 
   Max Work Item Dimensions: 
   Max Work Item Sizes: 
   Max Work Group Size: 
   Preferred Work Group Multiple: 

Card Number: 2
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x67df', 'subsystem_vendor': '0x1682', 'subsystem_device': '0xc570'}
   Decoded Device ID: Radeon RX 570
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
   Display Card Model: Radeon RX 570
   PCIe ID: 02:00.0
      Link Speed: 8 GT/s
      Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-57045EHD1-M90
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Frequency/Voltage Control Type: 1
   HWmon: /sys/class/drm/card2/device/hwmon/hwmon4
   Card Path: /sys/class/drm/card2/device
   ##################################################
   Current Power (W): 48.1
   Power Cap (W): 120.0
      Power Cap Range (W): [0, 180]
   Fan Enable: 1
   Fan PWM Mode: [1, 'Manual']
   Fan Target Speed (rpm): 1212
   Current Fan Speed (rpm): 1212
   Current Fan PWM (%): 36
      Fan Speed Range (rpm): [0, 3800]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 25
   Current Memory Loading (%): 12
   Current Temps (C): {'edge': 57.0}
      Critical Temp (C): 94.0
   Current Voltages (V): {'vddgfx': 906}
      Vddc Range: ['750mV', '1150mV']
   Current Clk Frequencies (MHz): {'sclk': 1068.25, 'mclk': 1850.0}
   Current SCLK P-State: [6, '1071Mhz']
      SCLK Range: ['300MHz', '2000MHz']
   Current MCLK P-State: [2, '1850Mhz']
      MCLK Range: ['300MHz', '2250MHz']
   Power Performance Mode: 5-COMPUTE
   Power Force Performance Level: manual
   ##################################################
   Device Name: Ellesmere
   Device Version: OpenCL 1.2 AMD-APP (3004.6)
   Driver Version: 3004.6
   Device OpenCL C Version: OpenCL C 1.2
   Max Compute Units: 32
   SIMD per CU: 4
   SIMD Width: 16
   SIMD Instruction Width: 1
   CL Max Memory Allocation: 1950176460
   Max Work Item Dimensions: 3
   Max Work Item Sizes: 1024 1024 1024
   Max Work Group Size: 1024
   Preferred Work Group Multiple: 64
csecht commented 4 years ago

For the output of the amdgpu-ls --clinfo option, what about omitting most of the data that is already provided by amdgpu-ls and adding a bit more OpenCL data? See example below. Doing that would make the --clinfo option function more like the other options that provide just option-specific output. Unfortunately, I don't know enough about OpenCL to say what clinfo data would be most helpful for amdgpu-utils users.

Card Number: 2
   Vendor: AMD
   GPU UID: 
   Display Card Model: Radeon RX 570
   PCIe ID: 02:00.0
   ##################################################
   Device Name: Ellesmere
   Device Version: OpenCL 1.2 AMD-APP (3004.6)
   Driver Version: 3004.6
   Device OpenCL C Version: OpenCL C 1.2
   Max Compute Units: 32
   SIMD per CU: 4
   SIMD Width: 16
   SIMD Instruction Width: 1
   CL Max Memory Allocation: 1950176460
   Max Work Item Dimensions: 3
   Max Work Item Sizes: 1024 1024 1024
   Max Work Group Size: 1024
   Preferred Work Group Multiple: 64
   Half-precision Floating-point support           (cl_khr_fp16)
   Single-precision Floating-point support         (core)
   Double-precision Floating-point support         (cl_khr_fp64)
   Image support                                   Yes
     Max number of samplers per kernel             16
     Max size for 1D images from buffer            134217728 pixels
     Max 1D or 2D image array size                 2048 images
     Base address alignment for 2D image buffers   256 bytes
     Pitch alignment for 2D image buffers          256 pixels
     Max 2D image size                             16384x16384 pixels
     Max 3D image size                             2048x2048x2048 pixels
     Max number of read image args                 128
     Max number of write image args                8
Ricks-Lab commented 4 years ago

I have optimized the code for reading from clinfo. This cleans up the code quite a bit and solves the missing data on the second card. For clinfo, my intent was to only include parameters relevant to setting command line arguments for the SETI apps. Not sure if Einstein apps also have command line arguments to optimize fft processing. Also, I wanted to limit parameters considered since there is no alignment in naming between AMD, NV, and INTEL.

csecht commented 4 years ago

That makes sense. Thanks. I'll ask on the E@H forums whether FFT processing can be optimized by users.

Ricks-Lab commented 4 years ago

Just made the last of planned changes and tested on Vega20, Type 2 GPU and changed status to Beta Release.

@csecht Let me know if you are able to test pac scenarios on your system. Thanks!

csecht commented 4 years ago

Yes, no problem. Is there a specific scenario in mind or should I just put it through it paces?

On Feb 26, 2020, at 5:38 AM, Rick notifications@github.com wrote:

Just made the last of planned changes and tested on Vega20, Type 1 GPU and changed status to Beta Release.

@csecht https://github.com/csecht Let me know if you are able to test pac scenarios on your system. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/53?email_source=notifications&email_token=ALMVCQUSYWWJHRRLQCUV4ULREZIFBA5CNFSM4KWIZZKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM74UGY#issuecomment-591383067, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALMVCQWHJZPVTFLLUF25IMTREZIFBANCNFSM4KWIZZKA.

Ricks-Lab commented 4 years ago

Changes include significant logic changes in how pac writes to cards. This is especially true for type 2 cards, which I have tested, but also affects type 1. I suggest just putting it through all possible use cases. I found it easiest to generate a bunch of write sh files and review them and then in execute mode with monitor running.

I have not yet implemented the warning message concerning change fan control back to dynamic. Hope to get to it tonight.

Ricks-Lab commented 4 years ago

Also, can you screen shot your pac window and replace the type 1 pac example image in the user guide? The image from my 4 GPU system is harder to read.

csecht commented 4 years ago

beta testing of branch 3 using amdgpu version: 19.50-967956

PAC Power Cap, OK to manually set, reset, max, reset.

Fan PWM: initial state was Dynamic (pwm1_enable=2) Failed initial attempt for manual set (enable=1). (see code snipet below)
OK for setting max (enable=0) and subsequent reset (into oscillating dynamic mode, enable=2). Conclude that PAC fails to manually set PWM (enable=1) from the dynamic state (enable=2); can only set manual fan PWM from max state. Can reset (enable=2) from manual state, just not the reverse. Failure to manual set PWM from dynamic state occurs whether the dynamic state is set at boot (thermostatic) or set by the PAC (oscillating fan); pwm1_enable=2 in both conditions.

# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_5442276dfbc14cdb9987ba537e95e92f.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '119' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
sh: echo: I/O error
PAC execution complete.

SCLK masking OK PPM changes OK (These cards don't allow overclocking or underclocking, so I can't test that.) Yes, I'll replace the type 1 pac pic in the branch 3 guide.

csecht commented 4 years ago

Hmm, I completed a push request to delete the old type1.png, but don't have push access to upload the replacement file. Not sure whether I did that right. Here is the screenshot: amdgpu-pac_type1

Ricks-Lab commented 4 years ago

For the changes, you should create a new branch from v3.0, and replace the png file, commit, push and then do a pull request.

Ricks-Lab commented 4 years ago

For over/under clocking, what is your current feature mask setting?

For the fan speed setting error, can you provide the details of the pac_writer file? It looks like there is a logic problem.

csecht commented 4 years ago

I'll work on the pull request tomorrow (about to go to bed).

AMD: Wattman features enabled: 0xffff7fff

pac_writer file snipet for going from manual fan PWM to "reset" (everything else in sh script is commented out):

# PWM entry: reset, Resetting to default mode of dynamic
sudo sh -c "echo '2' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable"
Ricks-Lab commented 4 years ago

You might want to try a feature mask of 0xfffd7fff. I think there shouldn’t be a reason you can’t underclock.

csecht commented 4 years ago

Regarding the can't-get-there-from-here fan PWM problem: it sounds very similar to what someone said in that external discussion thread you linked over in the thermostatic issue, where they described a problem using the app fancontrol having to go through the max setting to get from a dynamic to manual mode.

csecht commented 4 years ago

Ah, cool. I'll try that feature mask tomorrow. Thanks.

Ricks-Lab commented 4 years ago

I have added resets before setting manual or dynamic modes. Let me know if it makes a difference.

csecht commented 4 years ago

Yes, that worked. I can now switch to any PWM mode from any mode.
When switching from either manual or max to dynamic (pac reset), this is the what goes to pac_writer:

# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_e5825f00bf98447db5cc09cecdaf5b9d.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '0' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '2' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
PAC execution complete

(I notice that in the Master, the same pac change doesn't include the pass through the 0 state.)

When going from dynamic to manual, it's this:

# Write Delta mode.
Batch file completed: /home/craig/amdgpu-utils-3.0/pac_writer_d7be7e240c4f417693fe70a47d895801.sh
Writing 1 changes to GPU /sys/class/drm/card1/device
+ sudo sh -c echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1_enable
+ sudo sh -c echo '117' >  /sys/class/drm/card1/device/hwmon/hwmon3/pwm1
PAC execution complete.
csecht commented 4 years ago

That different feature mask worked to allow undervolting and underclocking (haven't tried overclocking). Something new to play with! I don't know how I missed that feature mask before in the user guide. Did it change or did I just mistype it?

I see that amdgpu-monitor now includes memory load %. That's nice. Are you going to add memory load graphing to amdgpu-plot?

csecht commented 4 years ago

I bungled the replacement of the pac_type1.png file. The github instructions for creating a new branch didn't work because I never got a "Create branch" option from the branch select menu. i was able to create a new file off the 3.0 branch, but couldn't figure out how to use that to upload the image. Somehow managed to upload the png in a new Master branch, which wasn't my intention. I haven't managed to delete the old png in the 3.0 branch. I'm putting my dunce cap on and going to find corner to sit in...

csecht commented 4 years ago

I closed my first two bungled pull request for that png replacement and opened another, this time from the 3.0 branch to pull into Master. I know, not what you wanted. I Still haven't figured out how to open a new branch. Recalibrating.............

csecht commented 4 years ago

I just noticed that the amdgpu-monitor terminal window leaves this when the module is closed, which it didn't in prior versions:

^Cctrl_c_handler (ID: 2) has been caught. Setting quit flag...

Obviously doesn't affect function, just a cosmetic thing.

Ricks-Lab commented 4 years ago

That different feature mask worked to allow undervolting and underclocking (haven't tried overclocking). Something new to play with! I don't know how I missed that feature mask before in the user guide. Did it change or did I just mistype it?

It is related to this pull request from another user.

I see that amdgpu-monitor now includes memory load %. That's nice. Are you going to add memory load graphing to amdgpu-plot?

I think plot is a bit too slow already. Until I find a way to make it faster, I will hold off on adding to it. I think there is a better way than concatenating dataframes. I will have to work on that in the future.

Ricks-Lab commented 4 years ago

I just noticed that the amdgpu-monitor terminal window leaves this when the module is closed, which it didn't in prior versions:

^Cctrl_c_handler (ID: 2) has been caught. Setting quit flag...

Obviously doesn't affect function, just a cosmetic thing.

I have simplified the message unless in debug mode.

Ricks-Lab commented 4 years ago

I found an issue in the previous version you tested (missing new line). I fixed this, and found resetting before going to manual is not necessary. I am only doing a reset before going back to dynamic. Is this working as expected now?

csecht commented 4 years ago

Yes, working as expected.