Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
136 stars 23 forks source link

Linux Distribution Dependent Behavior #69

Closed Ricks-Lab closed 3 years ago

Ricks-Lab commented 4 years ago

I would like to build in distribution dependent behavior and need help determining distribution specific commands. This includes the method of determining which distribution is used and which command is used to determine if a package is installed.

  1. Which distribution: Potentially use lsb_release, /etc/*-release, /proc/version, or hostnamectl
  2. Which tool to determine if a package is installed: dpkg for debian

Here is the list of distributions that I am aware are being used:

  1. Debian - @Ricks-Lab @smoe
  2. Gentoo - @CH3CN
  3. Arch - @berturion

Looking for feedback on distro behavior. Thanks!

Ricks-Lab commented 4 years ago

Here are details for Ubuntu:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:    18.04
Codename:   bionic
cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
cat /proc/version
Linux version 5.3.0-46-generic (buildd@lcy01-amd64-013) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020
hostnamectl
   Static hostname: nexon
   Pretty hostname: Nexon
         Icon name: computer-server
           Chassis: server
        Machine ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
           Boot ID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  Operating System: Ubuntu 18.04.4 LTS
            Kernel: Linux 5.3.0-46-generic
      Architecture: x86-64
berturion commented 4 years ago

Hello, here are the details for Arch Linux:

$ lsb_release -a
LSB Version:    1.4
Distributor ID: Arch
Description:    Arch Linux
Release:    rolling
Codename:   n/a

$ cat /etc/lsb-release
LSB_VERSION=1.4
DISTRIB_ID=Arch
DISTRIB_RELEASE=rolling
DISTRIB_DESCRIPTION="Arch Linux"

$ cat /etc/os-release
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="0;36"
HOME_URL="https://www.archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux

$ cat /proc/version
Linux version 5.6.5-arch3-1 (linux@archlinux) (gcc version 9.3.0 (Arch Linux 9.3.0-1)) #1 SMP PREEMPT Sun, 19 Apr 2020 13:14:25 +0000

$ hostnamectl
   Static hostname: arnold
         Icon name: computer-laptop
           Chassis: laptop
        Machine ID: *********
           Boot ID: *********
  Operating System: Arch Linux
            Kernel: Linux 5.6.5-arch3-1
      Architecture: x86-64

Also Arch linux packet manager is called pacman. The Arch wiki is known to be complete and precise. You will find all the information you need to query the package database. I think you can group Arch Linux distro specific commands with its arch-based distros.

CH3CN commented 4 years ago

Output for Gentoo. Gentoo uses portage as its package manager.

lsb_release -a
LSB Version:    n/a
Distributor ID: Gentoo
Description:    Gentoo Base System release 2.6
Release:        2.6
Codename:       n/a
cat /etc/lsb-release 
DISTRIB_ID="Gentoo"
cat /etc/os-release 
NAME=Gentoo
ID=gentoo
PRETTY_NAME="Gentoo/Linux"
ANSI_COLOR="1;32"
HOME_URL="https://www.gentoo.org/"
SUPPORT_URL="https://www.gentoo.org/support/"
BUG_REPORT_URL="https://bugs.gentoo.org/"
cat /proc/version 
Linux version 5.6.2-gentoo (root@gentoo) (gcc version 9.2.0 (Gentoo 9.2.0-r2 p3)) #1 SMP Mon Apr 6 10:33:34 EDT 2020
Ricks-Lab commented 4 years ago

@smoe Are you using a Debian distribution? Can you provide lsb_release -a output? I am updating amdgpu-chk to indicate verified distros.

smoe commented 4 years ago
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux bullseye/sid
Release:    unstable
Codename:   sid
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 10 (buster)
Release:    10
Codename:   buster
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux bullseye/sid
Release:    testing
Codename:   bullseye
Ricks-Lab commented 4 years ago

I would like to now implement a amdgpu driver confirmation function for arch and gentoo. I need help getting the output for verifying if the package is installed. I need the output of the appropriate command so I can write a parser. Need help to post that output here:

CH3CN commented 4 years ago

For Gentoo, there's many ways to find out if a package is installed. One is you can use equery which is part of the gentoolkit package:

Example of package not installed

$equery list dev-libs/amdgpu-pro-opencl
!!! No installed packages matching 'dev-libs/amdgpu-pro-opencl'
 * Searching for amdgpu-pro-opencl in dev-libs ...

Example of package installed

$equery list dev-libs/amdgpu-pro-opencl
 * Searching for amdgpu-pro-opencl ...
[IP-] [  ] dev-libs/amdgpu-pro-opencl-19.30.838629:0

The I in the brackets indicates the package is currently installed and P indicates the package is available in the Portage tree.

qlist, which is part of the portage-utils package, can also be used. When the package is installed, qlist shows the query package name. When the package is not installed, nothing is displayed.

$ qlist -I dev-libs/amdgpu-pro-opencl
dev-libs/amdgpu-pro-opencl

$ qlist -I dev-libs/amdgpu-pro-opencl
$

Lastly, if neither of those packages are installed, emerge can be used to check if the package is installed.

Installed

$ emerge -p dev-libs/amdgpu-pro-opencl

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   Rf  ~] dev-libs/amdgpu-pro-opencl-19.30.838629

Not installed

$emerge -p dev-libs/amdgpu-pro-opencl

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild  N     ] dev-util/patchelf-0.10 
[ebuild  N F   ] dev-libs/amdgpu-pro-opencl-19.30.838629  ABI_X86="32 (64)"

The letters contained in the brackets could be any of the following:

N   new (not yet installed)
S   new SLOT installation (side-by-side versions)
U   updating (to another version)
D   downgrading (best version seems lower)
r   reinstall (forced for some reason, possibly due to slot or sub-slot)
R   replacing (remerging same version)
F   fetch restricted (must be manually downloaded)
f   fetch restricted (already downloaded)
I   interactive (requires user input)
B   blocked by another package (unresolved conflict)
b   blocked by another package (automatically resolved conflict)

I hope these examples gives you an idea of how to query for the package. I did not have to install rocm or similar packages. I haven't tried using all of the utilities built, as I am having an issue with vext in the virtual environment when running amdgpu-monitor. Vext is installed but I ran into the issue described here: https://github.com/stuaxo/vext/issues/61. I haven't looked into this further though, as the Gentoo system with the AMD GPU is headless and runs BOINC in the background.

$ ./amdgpu-monitor 
Vext disabled:  There was an issue getting the system site packages.
Vext disabled:  There was an issue getting the system site packages.
gi import error: No module named 'gi'
gi is required for amdgpu-monitor
   In a venv, first install vext:  pip install --no-cache-dir vext
   Then install vext.gi:  pip install --no-cache-dir vext.gi
berturion commented 4 years ago
* **Arch** @berturion  - I think the command will be `pacman -Qs amdgpu rocm`

Specifying amdgpu and rocm on the same command line results in empty output. Also, on my machine pacman -Qs rocm outputs nothing.

$ pacman -Qs amdgpu
local/xf86-video-amdgpu 19.1.0-1 (xorg-drivers)
    X.org amdgpu video driver
Ricks-Lab commented 4 years ago

I haven't tried using all of the utilities built, as I am having an issue with vext in the virtual environment when running amdgpu-monitor. Vext is installed but I ran into the issue described here: stuaxo/vext#61. I haven't looked into this further though, as the Gentoo system with the AMD GPU is headless and runs BOINC in the background.

$ ./amdgpu-monitor 
Vext disabled:  There was an issue getting the system site packages.
Vext disabled:  There was an issue getting the system site packages.
gi import error: No module named 'gi'
gi is required for amdgpu-monitor
   In a venv, first install vext:  pip install --no-cache-dir vext
   Then install vext.gi:  pip install --no-cache-dir vext.gi

Running in a venv is not required. You can run the following to meet the package requirements without venv: sudo -H pip3 install --no-cache-dir -r requirements.txt

Ricks-Lab commented 4 years ago

@berturion I have made the change. Let me know when you have a chance to try it out.

Ricks-Lab commented 4 years ago

@CH3CN I have made the change. Let me know when you have a chance to try it out.

If it doesn't work, please run amdgpu-ls --debug for more details.

berturion commented 4 years ago

@berturion I have made the change. Let me know when you have a chance to try it out.

Ok, I will as soon as possible. It's a pleasure to help. Though, I think you could save time by installing Arch Linux in a VM and try the commands directly without depending on my feedback. Arch is VERY easy to install with https://www.anarchylinux.org/ in Virtualbox or burned with Balena Etcher or dd command on a USB stick. It is just a suggestion. I am happy to help.

stuaxo commented 4 years ago

Hi, author of vext here - I had a quick go at installing amdgpu and can reproduce the the issue with vext, I've got a little bit of free time coming up next week, so hopefully should be able to have a look at this then.

Cheers S

CH3CN commented 4 years ago

@CH3CN I have made the change. Let me know when you have a chance to try it out.

If it doesn't work, please run amdgpu-ls --debug for more details.

@Ricks-Lab, sorry for the delay in responding. I just had time yesterday to checkout the latest changes. I did not run the utils in a venv, so I didn't have to worry about the vext issue. The only issue I have is with amdgpu-monitor not being sized big enough to display the full name of the "Model". Also, what's the graceful way of terminating amdgpu-monitor? I have been using ctrl-c.

I also tried using ROCm instead of the amdgpu-pro-opencl package from Gentoo but my processor and hardware are too old to meet the requirements (PCIe v3 and atomics).

ch3cn@gentoo ~/amdgpu-utils-master $ ./amdgpu-chk 
Using python 3.7.7
           Python version OK. 
Using Linux Kernel 5.7.4-gentoo
           OS kernel OK. 
Using Linux distribution: Gentoo Base System release 2.6
           Distro has been Validated. 
Command dpkg not found. Can not determine amdgpu version.
           gpu-utils can still be used. 
python3 venv is installed
           python3-venv OK. 
amdgpu-utils-env is NOT available
           amdgpu-utils-env should be configured per User Guide. 
Environment not configured. WARNING
Not in amdgpu-utils-env (Only needed if you want to duplicate dev env)
           amdgpu-utils-env can be activated per User Guide.
ch3cn@gentoo ~/amdgpu-utils-master $ ./amdgpu-ls 
Detected GPUs: INTEL: 1, AMD: 1
AMD: amdgpu version: dev-libs/amdgpu-pro-opencl-19.50.967956
AMD: Wattman features enabled: 0xfffd7fff
2 total GPUs, 1 rw, 0 r-only, 0 w-only

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '0x0112', 'subsystem_device': '0x0112', 'subsystem_vendor': '0x1849', 'vendor': '0x8086'}
   Decoded Device ID: 2nd Generation Core Processor Family Integrated Graphics Controller
   Card Model: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
   PCIe ID: 00:02.0
   Driver: i915
   GPU Type: Unsupported
   HWmon: None
   Card Path: /sys/class/drm/card0/device
   System Card Path: /sys/devices/pci0000:00/0000:00:02.0

Card Number: 1
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: None
   Device ID: {'device': '0x67ef', 'subsystem_device': '0x22de', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
   Decoded Device ID: Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev cf)
   Display Card Model: Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]
   PCIe ID: 01:00.0
      Link Speed: 5.0 GT/s PCIe
      Link Width: 8
   ##################################################
   Driver: amdgpu
   vBIOS Version: xxx-xxx-xxx
   Compute Platform: OpenCL 1.2 AMD-APP (3004.6)
   GPU Type: PStates
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon2
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
   ##################################################
   Current Power (W): 37.179
   Power Cap (W): 48.000
      Power Cap Range (W): [0, 72]
   Fan Enable: 0
   Fan PWM Mode: [2, 'Dynamic']
   Fan Target Speed (rpm): 973
   Current Fan Speed (rpm): 973
   Current Fan PWM (%): 31
      Fan Speed Range (rpm): [0, 4600]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 77
   Current Memory Loading (%): 64
   Current GTT Memory Usage (%): 45.289
      Current GTT Memory Used (GB): 1.359
      Total GTT Memory (GB): 3.000
   Current VRAM Usage (%): 97.992
      Current VRAM Used (GB): 1.960
      Total VRAM (GB): 2.000
   Current  Temps (C): {'edge': 61.0}
   Critical Temps (C): {'edge': 94.0}
   Current Voltages (V): {'vddgfx': 1031}
      Vddc Range: ['800mV', '1150mV']
   Current Clk Frequencies (MHz): {'mclk': 1750.0, 'sclk': 1212.0}
   Current SCLK P-State: [7, '1212Mhz']
      SCLK Range: ['214MHz', '1800MHz']
   Current MCLK P-State: [1, '1750Mhz']
      MCLK Range: ['300MHz', '2000MHz']
   Power Profile Mode: 1-3D_FULL_SCREEN
   Power DPM Force Performance Level: auto
┌─────────────┬────────────────┐
│Card #       │card1           │
├─────────────┼────────────────┤
│Model        │Baffin [Radeon R│
│GPU Load %   │100             │
│Mem Load %   │0               │
│VRAM Usage % │98.889          │
│GTT Usage %  │44.691          │
│Power (W)    │29.248          │
│Power Cap (W)│48.0            │
│Energy (kWh) │0.09            │
│T (C)        │55.0            │
│VddGFX (mV)  │1031            │
│Fan Spd (%)  │31              │
│Sclk (MHz)   │1212            │
│Sclk Pstate  │7               │
│Mclk (MHz)   │1750            │
│Mclk Pstate  │1               │
│Perf Mode    │1-3D_FULL_SCREEN│
└─────────────┴────────────────┘
Ricks-Lab commented 4 years ago

Yes, ctrl-c is the expected way to terminate when not running with the --gui option. Can you try that just make sure it works in your distro? Also it would be good to check out amdgpu-plot. I have also posted a PyPI package and have started another issue thread to discuss issues with it, if you want to give it a try.

The model names are purposely truncated, as they can be too long for a useful display of multiple GPUs.

KeithMyers commented 4 years ago

In case you don't see my post in the other thread about your python command, the correct invocation is: pip3 install ricks-amdgpu-utils

CH3CN commented 4 years ago

Yes, ctrl-c is the expected way to terminate when not running with the --gui option. Can you try that just make sure it works in your distro? Also it would be good to check out amdgpu-plot. I have also posted a PyPI package and have started another issue thread to discuss issues with it, if you want to give it a try.

The model names are purposely truncated, as they can be too long for a useful display of multiple GPUs.

Great. The computer is headless but I used X forwarding over SSH to run amdgpu-monitor --gui and amdgpu-plot. It worked fine. The time indicated at the top of the plot window seems to be displaying in UTC. Is there a way to have it displayed in local time zone?

Screenshot_20200623_195350 Screenshot_20200623_195804

I will try the PyPI package tomorrow.

Ricks-Lab commented 4 years ago

Thanks for checking it out! Looks like everything works. To use local time zone instead of UTC, just use the --ltz option.

stuaxo commented 4 years ago

A little more progress on the vext front.

Under virtualenv and current setuptools everything is installed correctly - except the pth that enables it.

You can fix this by running $ vext -e