Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

Release Candidate - Testing Requested #76

Closed Ricks-Lab closed 4 years ago

Ricks-Lab commented 4 years ago

I have prepared v3.2.0 Release Candidate 1 on master. I have tested on my 3 systems. Looks good so far. Please provide your experience here as verification/feedback before release planned for this coming weekend. Thanks!

csecht commented 4 years ago

I will work updating the User Guide. I've a question about the listing of p-states from my Navi 10 card (below). It is similar to what is currently in the Guide for your Vega 20 card, but i just recently noticed it. With amdgpu-ls --pstates, there are two sets of frequencies for sclk and mclk curve endpoints. i don't clearly understand what the two sets represent. In amdgpu-monitor, the highest SCLK p-state I see wilth the card under load is '2', which seems to correspond to the '2' in the first set of --pstates and the '2' in the SCLK mask of amdgpu-pac. Everything fine there. Yet the highest MCLK p-state of '3' that I see in amdgpu-monitor, which also shows in the MCLK mask of amdgpu-pac, does not correspond with anything in amdgpu-ls --pstates. How should these various p-states for Type 2 cards be explained in the User Guide?

$ ./amdgpu-ls --pstates
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 1
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
   Card Path: /sys/class/drm/card1/device
   GPU Frequency/Voltage Control Type: 2
   SCLK:                   MCLK:
    0:  300Mhz              0:  100Mhz  
    1:  1040Mhz             1:  500Mhz  
    2:  1780Mhz             2:  625Mhz  
   SCLK:                   MCLK:
    0:  800Mhz    -         
    1:  1780Mhz   -         1:  875MHz    -       
   VDDC_CURVE:
    0: ['800MHz', '707mV']
    1: ['1290MHz', '750mV']
    2: ['1780MHz', '959mV']

BTW, I have been able to overclock and underclock the endpoints and undervolt the curve.

Ricks-Lab commented 4 years ago

I have modified the format of amdgpu-ls --pstate to be more clear. For type 2 cards, there are no pstates defined in the pp_od_clk_voltage file, so I show the pstates from the pp_dpm_[sm]clk files. Type 2 cards do not define the curve with pstates, but instead use AVFS on a curve defined by the 3 Freq/Voltage curve points.

To overclock, I assume you would not need to change the curve, but just define an operating point at a higher Frequency then the stock highest. This may be limited by the OD_Range points.

csecht commented 4 years ago

Good, got it. When might it be useful to change the curve? (I edited a typo in my previous comment from "I have been about to overclock..." to "I have been able to overclock...) Yes, I have overclocked by raising the sclk or mclk OD curve endpoints and have undervolted by lowering the mV of the 3rd vddc curve point. I haven't tried altering that curve in any other way.

Ricks-Lab commented 4 years ago

I have always been working to manage power, so I don't have much experience overclocking, though I have tried it with older cards in some benchmarking I was doing. The curve is what defines how AVFS works. The GPU is meant to operate on that curve. Perhaps the curve doesn't represent operating points beyond the curve accurately, so maybe redefining an end point might make sense. Perhaps it is a good idea to plot the curve in excel and see how any modified curve would compare. Another use could be instability for an aged card. Maybe you get get more life out of it by shifting the whole curve by a voltage offset.

Ricks-Lab commented 4 years ago

@csecht Please pull the latest from master. I made some code optimizations (pre-compile regex and optimize some string searches). I also made a minor change to pac interface.

Ricks-Lab commented 4 years ago

@csecht I have merged your pull request. Looks good!

A couple of minor observations:

Have you been able to test the latest on master on your systems? On my systems, it is more responsive with the optimizations.

csecht commented 4 years ago

Okay, I’ll update the plot and pac Type1 examples tomorrow. Yes, I did notice that the monitor GUI launches very quickly. Nice. I’ll test the other modules tomorrow.

On Jun 3, 2020, at 6:52 PM, Rick notifications@github.com wrote:

@csecht https://github.com/csecht I have merged your pull request. Looks good!

A couple of minor observations:

Applicable version should be 3.2.x The plot example is from old version. There are minor format changes in the latest. The pac example for Type 1 cards is not the latest. Have you been able to test the latest on master on your systems? On my systems, it is more responsive with the optimizations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/76#issuecomment-638519709, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALMVCQSLLDY7ZERFI5RDIULRU3O2XANCNFSM4NQVTHMA.

Ricks-Lab commented 4 years ago

I implemented another optimization by using an Enum object in the definition of sensors instead of using names which should be slightly faster. It was a major change, so a thorough review of amdgpu-ls parameters would be a good idea.

csecht commented 4 years ago

It all looks good. Nice and responsive too.

Ricks-Lab commented 4 years ago

@csecht I merged your pull request. Looks good!
I plan to make the release tomorrow.

csecht commented 4 years ago

I got this error this morning with PAC whenever I try to change any parameter:

$ ./amdgpu-pac --execute
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

# Write Delta mode.
Traceback (most recent call last):
  File "./amdgpu-pac", line 838, in save_card
    old_pwm = int(v.get_params_value('fan_pwm')) if v.get_params_value('fan_pwm').isnumeric() else None
AttributeError: 'float' object has no attribute 'isnumeric'

At which point it just hangs and I don't get a prompt to enter my sudo password. Yesterday I installed then ununstalled rocm and reinstalled AMDGPU for OpenCL. Not sure whether it's a change with amdgpu-utils or something amiss in the driver package. The RX 5600 XT card is crunching fine at its default values, however.

EDIT: I was able to successfully run a startup PAC BASH script, as a service, to change the sclk endpoint for that card, so the device files can be edited.

Ricks-Lab commented 4 years ago

@csecht I think I fixed the problem. It looks like my original fix for this is what caused the writing of 0 to the fan. Let me know if it works now.

Ricks-Lab commented 4 years ago

I just pushed a more robust approach.

csecht commented 4 years ago

Yes, that fixed it.

Ricks-Lab commented 4 years ago

Still not happy with the robustness of the solution, so I will delay official release for a week. I did enhance critical temp reading and display value for all sensors in amdgpu-ls and implemented a more generic read of voltages which will work if multiple voltage sensors are available.

Ricks-Lab commented 4 years ago

I think I have a more robust solution for dealing with variable types of numeric values in pac and monitor. While working on this, I implemented Enum for GPU Types and Vendors. This makes it so I no longer use numeric type indicators and use enumerated names instead. Perhaps the Users Guide needs to be updated with these new type names:

GPU_Type = GpuEnum('type', 'Undefined PStatesNE PStates CurvePts')

Probably only the the last two are relevant to the user.

csecht commented 4 years ago

I edited the User Guide accordingly and issued Pull Request. "Type 0" was replaced with Type Undefined, etc. Type PStatesNE was not introduced in the guide.

Ricks-Lab commented 4 years ago

Actually, Type0 was used for and older GPU that had non-editable p-states, but the re-write in 3.x seems to have eliminated that classification. I have one old card. Maybe I will work with that to re-implement the classification of PStatesNE type.

csecht commented 4 years ago

Would Undefined be used as an else condition for unforeseen cards that don't match any known state?
What about the HD series of cards that some users might still have?

Ricks-Lab commented 4 years ago

Yes, Undefined is the default type. It gets set to PStates or CurvePts when the pstates are read from the pp_od_clk_voltage file.

It looks like the code I had to set the Type for HD series is missing after the rewrite. I need to put an old card back in and work it out again with the new code base.

I have made some user guide modifications, so be sure to pull the latest if you are going to make some edits.

Ricks-Lab commented 4 years ago

I have implemented a few more Enum objects and made a major change to how sensors are read. It should be much more efficient now. I think that was the last major change for release 3.2. I will release this weekend, so let me know if you see any issues.

Ricks-Lab commented 4 years ago

It looks like I gave away the R9 290x card I had, so I installed an older HD 7870 GPU. It had only a few parameters available, but I am not sure if this is due to not having amdgpu installed. I am using Ubuntu 20.04, and there is no amdgpu install package for it yet. Anyway, here is what I get with amdgpu-ls:

Card Number: 0
   Vendor: AMD
   Readable: True
   Writable: False
   Compute: False
   GPU UID: 
   Device ID: {'vendor': '0x1002', 'device': '0x6818', 'subsystem_vendor': '0x1462', 'subsystem_device': '0x2740'}
   Decoded Device ID: Pitcairn XT [Radeon HD 7870 GHz Edition]
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]
   Display Card Model: Pitcairn XT [Radeon HD 7870 GHz Edition]
   PCIe ID: 08:00.0
      Link Speed: 8 GT/s
      Link Width: 16
   ##################################################
   Driver: radeon, amdgpu
   Compute Platform: None
   GPU Frequency/Voltage Control Type: Legacy
   HWmon: /sys/class/drm/card0/device/hwmon/hwmon4
   Card Path: /sys/class/drm/card0/device
   ##################################################
   Fan PWM Mode: [2, 'Dynamic']
   Current Fan PWM (%): 28
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current  Temps (C): {'unnamed': 28.0}
   Critical Temps (C): {'unnamed': 120.0}
   Power DPM Force Performance Level: auto

Also, I am now reading the device id details and decoding from pciid file for the non-readable onboard GPUs. Here is what I get for my server system:

Card Number: 0
   Vendor: ASPEED
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'vendor': '0x1a03', 'device': '0x2000', 'subsystem_vendor': '0x1458', 'subsystem_device': '0x1000'}
   Decoded Device ID: ASPEED Graphics Family
   Card Model: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
   PCIe ID: c4:00.0
   Driver: ast
   HWmon: None
   Card Path: /sys/class/drm/card0/device

I ran into a problem where the missing values in Legacy cards causes problems with monitor and plot, so I exclude them when getting a list of readable GPUs.

csecht commented 4 years ago

So, for the User Guide, what is the difference between PStatesNE and Legacy cards?

csecht commented 4 years ago

A minor point in formatting output from amdgpu-ls:

   Current  Temps (C): {'mem': 88.0, 'edge': 63.0, 'junction': 69.0}
   Critical Temps (C): {'mem': 99.0, 'junction': 99.0, 'edge': 118.0}

For Current Temps, the order of 'edge' and 'junction' ought to be switched, to match the order in Critical Temps (or visa versa).

Ricks-Lab commented 4 years ago

I am concerned that the observations for HD 7870 are very different from what I observed for R9 290x. Not sure if it is a real difference, or an artifact of not having amdgpu driver package installed on my 20.04 system. Let's hold off documenting Legacy and PStatesNE until I get more clarity.

Ricks-Lab commented 4 years ago

A minor point in formatting output from amdgpu-ls:

   Current  Temps (C): {'mem': 88.0, 'edge': 63.0, 'junction': 69.0}
   Critical Temps (C): {'mem': 99.0, 'junction': 99.0, 'edge': 118.0}

For Current Temps, the order of 'edge' and 'junction' ought to be switched, to match the order in Critical Temps (or visa versa).

Implemented sorting of dictionaries for print in the latest on master.

csecht commented 4 years ago

I just remembered I had a Radeon HD 4650, so I installed it in my machine with Ubuntu 18.04, kernel 5.3.0, and amdgpu version 20.10-1048554, then ran amdgpu-ls from the most recent Master, and got this:

Traceback (most recent call last):
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1379, in set_gpu_list
    hw_file_srch = glob.glob(os.path.join(card_path, env.GUT_CONST.hwmon_sub) + '?')
  File "/usr/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Get a similar error with all other amdgpu-utils commands, except amdgpu-chk. Here is information from lspci:

$ lspci -k -nn -s 01:00.0
01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RV730 PRO [Radeon HD 4650] [1002:9498]
    Subsystem: PC Partner Limited / Sapphire Technology RV730 PRO [Radeon HD 4650] [174b:9498]
    Kernel modules: radeon
Ricks-Lab commented 4 years ago

It looks like card_path is not set. I made some changed to deal with it by setting Type to a new type, Unsupported. Could you provide debug output so that I can make sure the solution is robust?

csecht commented 4 years ago

$ ./amdgpu-ls --debug debug_gpu-utils_20200610-192949.log

Ricks-Lab commented 4 years ago

Is that log from the latest on master? I added a few more log statements in the latest.

csecht commented 4 years ago

sorry. Here is the terminal stdout

$ ./amdgpu-ls --debug
Ubuntu: Validated
Traceback (most recent call last):
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1418, in set_gpu_list
    'compute_platform': opencl_device_version})
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 607, in populate_prm_from_dict
    source_value.replace('{}card'.format(env.GUT_CONST.card_root), '').replace('/device', ''))
AttributeError: 'NoneType' object has no attribute 'replace'

and here is the debug file: debug_gpu-utils_20200610-193528.log

Ricks-Lab commented 4 years ago

I think I have covered the other places where card_path is referenced. Let me know when you get a chance to try it out.

csecht commented 4 years ago

Got it. Here is teminal

$ ./amdgpu-ls --debug
Ubuntu: Validated
Traceback (most recent call last):
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1424, in set_gpu_list
    rdata = self[gpu_uuid].read_gpu_sensor('id', vendor=GpuItem.GPU_Vendor.AMD, sensor_type='DEVICE')
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 954, in read_gpu_sensor
    file_path = os.path.join(sensor_path, sensor_file)
  File "/usr/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

and the debug debug_gpu-utils_20200610-195255.log

Ricks-Lab commented 4 years ago

Looks like readable flag was still True for unsupported GPUs. I fixed that.

csecht commented 4 years ago

Hmmmm. The terminal:

$ ./amdgpu-ls --debug
Ubuntu: Validated
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 994, in emit
    msg = self.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 840, in format
    return fmt.format(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 577, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1384, in set_gpu_list
    logger.debug('GPU[{}] type set to Unsupported', gpu_uuid)
Message: 'GPU[{}] type set to Unsupported'
Arguments: ('583a7958fb3742a492abed0a9f430573',)
Traceback (most recent call last):
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1427, in set_gpu_list
    rdata = self[gpu_uuid].read_gpu_sensor('id', vendor=GpuItem.GPU_Vendor.AMD, sensor_type='DEVICE')
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 954, in read_gpu_sensor
    file_path = os.path.join(sensor_path, sensor_file)
  File "/usr/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

and debug: debug_gpu-utils_20200610-203338.log

Ricks-Lab commented 4 years ago

Oops... Used wrong string format in logger. Fixed and pushed.

csecht commented 4 years ago

not quite...

$ ./amdgpu-ls --debug
Ubuntu: Validated
Traceback (most recent call last):
  File "./amdgpu-ls", line 147, in <module>
    main()
  File "./amdgpu-ls", line 94, in main
    gpu_list.set_gpu_list(clinfo_flag=True)
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 1428, in set_gpu_list
    rdata = self[gpu_uuid].read_gpu_sensor('id', vendor=GpuItem.GPU_Vendor.AMD, sensor_type='DEVICE')
  File "/home/craig/amdgpu-utils-master/GPUmodules/GPUmodule.py", line 954, in read_gpu_sensor
    file_path = os.path.join(sensor_path, sensor_file)
  File "/usr/lib/python3.6/posixpath.py", line 80, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

debug_gpu-utils_20200610-203912.log

Ricks-Lab commented 4 years ago

It looks like the readable flag is still True. Not sure why, so I have added more logger statements.

csecht commented 4 years ago

The debug says it's looking in the path /sys/devices/, but the only thing there is the CPU. Shouldn't it look in /sys/class/drm/ where the GPUs are? The HD 4650 is in the first PCI slot, so ...

$ ls /sys/class/drm/card0/device
ari_enabled               current_link_width  enable         irq             msi_bus    resource      subsystem
boot_vga                  d3cold_allowed      firmware_node  label           msi_irqs   resource0     subsystem_device
broken_parity_status      device              graphics       local_cpulist   numa_node  resource2     subsystem_vendor
class                     dma_mask_bits       i2c-0          local_cpus      power      resource2_wc  uevent
config                    driver              i2c-1          max_link_speed  remove     resource4     vendor
consistent_dma_mask_bits  driver_override     i2c-2          max_link_width  rescan     revision
current_link_speed        drm                 index          modalias        reset      rom

In the /sys /devices directory:

$ ls /sys/devices/
breakpoint   cstate_pkg  isa          msr         pnp0      system      uncore_cbox_0  uprobe
cpu          i915        kprobe       pci0000:00  power     tracepoint  uncore_cbox_1  virtual
cstate_core  intel_pt    LNXSYSTM:00  platform    software  uncore_arb  uncore_imc
Ricks-Lab commented 4 years ago

The way to associate the correct card path is by looking for the full system device path with the pcie id in the pathname. This version of the pathname is derived from the typical card_path name using resolve.

So I check full system path of each potential card path for a match to the pcie_id. If a match is found, then that card path is associated with the pcie_id. For this card, no match is found.

Ricks-Lab commented 4 years ago

From the log file:

This card has pcie_id of: 01:00.0 [01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV730 PRO [Radeon HD 4650]

There are 2 potential card paths: 0 & 1 /sys/class/drm/card1/device = /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:00.0/0000:04:00.0

/sys/class/drm/card0/device = /sys/devices/pci0000:00/0000:00:02.0

Neither matches the pcie_id of 01:00.0

so, GPU type set to Unsupported

Ricks-Lab commented 4 years ago

Even if we find that there is a valid card path, still need to fix the issue where an unsupported card is interpreted as readable. Let's get this one fixed first, then work on a potential issue of matching a pcie_id to a card path.

Ricks-Lab commented 4 years ago

To make the card path details more clear, I have added the system card path to the output of amdgpu-ls. I have also implemented the amdgpu-ls --short option to give a brief report of basic GPU properties.

Ricks-Lab commented 4 years ago

I have discovered an inconsistency in the way I was accessing the list of GPU's. Maybe this was the source of unreadable cards being read. But the real problem was that I was only checking readability flag in GpuList.read_gpu_sensor_data and not in GpuItem.read_gpu_sensor_data.

csecht commented 4 years ago

Yes! it's working now to deal with the unsupported card:

$ ./amdgpu-ls --short
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
3 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '0x3e91', 'subsystem_device': '0x8694', 'subsystem_vendor': '0x1043', 'vendor': '0x8086'}
   PCIe ID: 00:02.0
   HWmon: None
   Card Path: /sys/class/drm/card0/device
   System Card Path: /sys/devices/pci0000:00/0000:00:02.0

Card Number: 
   Vendor: AMD
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '', 'subsystem_device': '', 'subsystem_vendor': '', 'vendor': ''}
   PCIe ID: 01:00.0
   HWmon: None
   Card Path: None
   System Card Path: None

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   Device ID: {'device': '0x731f', 'subsystem_device': '0xe411', 'subsystem_vendor': '0x1da2', 'vendor': '0x1002'}
   Display Card Model: Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
   PCIe ID: 04:00.0
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:00.0/0000:04:00.0

Now, about that card path... While amdgpu-ls lists the unsupported card's name it does't list the card path or pci-ids. The undefined card, however, does have a path and its vendor and device pci-ids are listed with $ lspci -k -nn (as previously commented). Example:

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] RV730 PRO [Radeon HD 4650] [1002:9498]
    Subsystem: PC Partner Limited / Sapphire Technology RV730 PRO [Radeon HD 4650] [174b:9498]
    Kernel modules: radeon

A grep for those pci-ids shows that card's path is /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/ and pci-id data is in there. Examples:

$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/uevent
PCI_CLASS=30000
PCI_ID=1002:9498
PCI_SUBSYS_ID=174B:9498
PCI_SLOT_NAME=0000:01:00.0
MODALIAS=pci:v00001002d00009498sv0000174Bsd00009498bc03sc00i00

And...

$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/subsystem_vendor
0x174b
$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/subsystem_device
0x9498

$ ls /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/
ari_enabled               current_link_speed  enable          max_link_width  rescan        resource4         uevent
boot_vga                  current_link_width  firmware_node   modalias        reset         revision          vendor
broken_parity_status      d3cold_allowed      irq             msi_bus         resource      rom
class                     device              local_cpulist   numa_node       resource0     subsystem
config                    dma_mask_bits       local_cpus      power           resource0_wc  subsystem_device
consistent_dma_mask_bits  driver_override     max_link_speed  remove          resource2     subsystem_vendor
csecht commented 4 years ago

A minor point of formatting amdgpu-ls --help stdout:

$ ./amdgpu-ls --help
usage: amdgpu-ls [-h] [--about] [--short] [--table] [--pstates] [--ppm]
                 [--clinfo] [--no_fan] [-d]

optional arguments:
  -h, --help   show this help message and exit
  --about      README
  --short      Short listing basic GPU details
  --table      Output table of basic GPU details
  --pstates    Output pstate tables instead of GPU details
  --ppm        Output power/performance mode tables instead of GPU details
  --clinfo     Include openCL with card details
  --no_fan     do not include fan setting options
  -d, --debug  Debug output

To match the terminal output of the --table option should instead read, --table Current status of readable GPUs

Ricks-Lab commented 4 years ago

I am going to need to think about how to deal with cards that don't have a normal card_path. I am currently only examining the system path of card paths that exist. I will work on it over the weekend.

Hope you don't mind, but I have made significant changes across all modules to deal with the issue causing confusion in the way I access gpu's in a GPU List. The code is now much more intuitive. I have only tried on one of my systems, but it is getting late here. I will push to master. Let me know if you find any issues. It also includes the help format change.

csecht commented 4 years ago

I ran through all the commands and everything is working. Nice. The amdgpu-ls output is clear regarding how many GPUs are detected and which can be modified:

$ ./amdgpu-ls
Detected GPUs: INTEL: 1, AMD: 2
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
3 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '0x3e91', 'subsystem_device': '0x8694', 'subsystem_vendor': '0x1043', 'vendor': '0x8086'}
   Decoded Device ID: 8th Gen Core Processor Gaussian Mixture Model
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   HWmon: None
   Card Path: /sys/class/drm/card0/device
   System Card Path: /sys/devices/pci0000:00/0000:00:02.0

Card Number: 
   Vendor: AMD
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '', 'subsystem_device': '', 'subsystem_vendor': '', 'vendor': ''}
   Decoded Device ID: UNDETERMINED
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] RV730 PRO [Radeon HD 4650]
   PCIe ID: 01:00.0
   Driver: radeon
   HWmon: None
   Card Path: None
   System Card Path: None

Card Number: 1
   Vendor: AMD
   Readable: True
>and so on...
Ricks-Lab commented 4 years ago

Still researching how to get the /sys/devices path for a specific pcie ID. My first attempt is this code:

sys_pci_dirs = glob.iglob('/sys/devices/pci*:*/**/*:{}'.format(pcie_id), recursive=True)

But it maxes out cpu for a long time and hasn't returned anything useful yet. Still need to do some research.

csecht commented 4 years ago

Maybe the use of a naked '*' is too greedy. Would a more explicit regex work?

sys_pci_dirs = glob.iglob('/sys/devices/pci\d*:\d*/\d*:{}'.format(pcie_id), recursive=True)

or this; is more general, but uses dot. to give something to work on and '?' removes the greediness of

sys_pci_dirs = glob.iglob('/sys/devices/pci.*:.*?/.*?:{}'.format(pcie_id), recursive=True)

I tested these regex out on https://pythex.org/ and both seem to work for matching up to the pcie-id.