Ricks-Lab / gpu-utils

A set of utilities for monitoring and customizing GPU performance
GNU General Public License v3.0
142 stars 23 forks source link

amdgpu-plot erasing plot #39

Closed csecht closed 5 years ago

csecht commented 5 years ago

I haven't used amdgpu-plot in a while, but just noticed today that, after a few minutes running, it begins to erase the oldest readings. With every data tick added, displayed data points are removed from the left side. Screenshot from 2019-08-13 18-14-31

csecht commented 5 years ago

Just another view of amdgpu-plot after it ran for a couple hours... amdgpu-plot_shrinkage

Ricks-Lab commented 5 years ago

I have never noticed this behavior in the past. I wonder if there is an old package dependency. I have a new project I am working on where I am leveraging a requirements file to make sure all dependencies are met. I will dig into this one over the weekend.

Ricks-Lab commented 5 years ago

Can you verify which version of matplotlib you are using with the following command:

./amdgpu-plot --about
csecht commented 5 years ago

Version: v2.5.2 Maintainer: RueiKe Status: Stable Release matplotlib version: 2.1.1 pandas version: 0.24.2 numpy version: 1.16.2

csecht commented 5 years ago

I just ran amdgpu-plot and it's graphing fine now. Over the past few days, I've done a system update and a few restarts, but am not sure what fixed it or why it was buggy. Next time I'll record the --about data when it, or any other module, starts acting up.

Ricks-Lab commented 5 years ago

There is a known issue where the way I am using matplotlib eventually stops working correctly. It get's corrupted after about an hour. My approach is probably flawed, as I am redrawing the entire plot every update. There is a way to add to the plot, but I have not figured out to implement in yet. There is a comment in the README indicating this.

I have started another private project, and I have found that alignment of python environments is critical among collaborators. I am using python venv to accomplish it. I am not sure of the best way to make it available to casual users, but here is how I implement in the other project: First, install venv:

sudo apt install -y python3-venv

Then activate the environment while in the project root directory:

python3 -m venv amdgpu-env
source amdgpu-env/bin/activate

The first time, or anytime the requirements file changes, you will need to execute this:

pip install --no-cache-dir -r requirements.txt

To exit the venv, execute the deactivate command.

Maybe there is a way to make all of this happen without the user having to be aware of all of the details. I will continue to research. Let me know your thoughts.

Ricks-Lab commented 5 years ago

I was thinking about modifying amdgpu-chk to check the existence of the virtual env and create if needed and then run the pip install command.

Ricks-Lab commented 5 years ago

@csecht I have made significant code updates, one minor bug, no new features, lots of PEP8 style updates. I have also added requirements.txt file for pip install. I have also enabled the use of venv, which I find quite useful. It is the latest on master. Let me know if you find any issues.

csecht commented 5 years ago

All seems fine on my local Linux host, but amdgpu-plot doesn't work on my remote host that has the same system, Ubuntu 18.04.3. This is what I get:

~/Desktop/amdgpu-utils-master$ ./amdgpu-plot
Traceback (most recent call last):
  File "./amdgpu-plot", line 61, in <module>
    import pandas as pd
  File "/usr/lib/python3/dist-packages/pandas/__init__.py", line 58, in <module>
    from pandas.io.api import *
  File "/usr/lib/python3/dist-packages/pandas/io/api.py", line 19, in <module>
    from pandas.io.packers import read_msgpack, to_msgpack
  File "/usr/lib/python3/dist-packages/pandas/io/packers.py", line 68, in <module>
    from pandas.util._move import (
ValueError: module functions cannot set METH_CLASS or METH_STATIC

( But -plot hadn't been working on the remote host prior to the recent master, either, because I wasn't ever able to get pandas installed.) All other amdgpu-utils seem okay on the remote host. I tried installing the requirements.txt file on the remote host and got an error:

~/Desktop/amdgpu-utils-master$ sudo -H pip3 install --no-cache-dir -r requirements.txt
[sudo] password for craig: 
Requirement already satisfied: cycler==0.10.0 in /usr/lib/python3/dist-packages (from -r requirements.txt (line 1))
Collecting kiwisolver==1.1.0 (from -r requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/f8/a1/5742b56282449b1c0968197f63eae486eca2c35dcd334bab75ad524e0de1/kiwisolver-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (90kB)
    100% |████████████████████████████████| 92kB 606kB/s 
Collecting matplotlib==3.1.1 (from -r requirements.txt (line 3))
  Downloading https://files.pythonhosted.org/packages/57/4f/dd381ecf6c6ab9bcdaa8ea912e866dedc6e696756156d8ecc087e20817e2/matplotlib-3.1.1-cp36-cp36m-manylinux1_x86_64.whl (13.1MB)
    100% |████████████████████████████████| 13.1MB 32.0MB/s 
Collecting numpy==1.17.1 (from -r requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/75/92/57179ed45307ec6179e344231c47da7f3f3da9e2eee5c8ab506bd279ce4e/numpy-1.17.1-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)
    100% |████████████████████████████████| 20.4MB 1.1MB/s 
Collecting pandas==0.25.1 (from -r requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/73/9b/52e228545d14f14bb2a1622e225f38463c8726645165e1cb7dde95bfe6d4/pandas-0.25.1-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
    100% |████████████████████████████████| 10.5MB 1.3MB/s 
Collecting pkg-resources==0.0.0 (from -r requirements.txt (line 6))
  Could not find a version that satisfies the requirement pkg-resources==0.0.0 (from -r requirements.txt (line 6)) (from versions: )
No matching distribution found for pkg-resources==0.0.0 (from -r requirements.txt (line 6))

The basics check out, however:

 ~/Desktop/amdgpu-utils-master$ ./amdgpu-chk
Using python 3.6.8
           Python version OK. 
Using Linux Kernel 5.0.0-27-generic
           OS kernel OK. 
AMD GPU driver is driver=amdgpu latency=0
           AMD driver OK. 

The requirements installation worked fine on my local host.

~/Desktop/amdgpu-utils$ sudo -H pip3 install --no-cache-dir -r requirements.txt
Requirement already satisfied: cycler==0.10.0 in /usr/lib/python3/dist-packages (from -r requirements.txt (line 1))
Collecting kiwisolver==1.1.0 (from -r requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/f8/a1/5742b56282449b1c0968197f63eae486eca2c35dcd334bab75ad524e0de1/kiwisolver-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (90kB)
    100% |████████████████████████████████| 92kB 1.9MB/s 
Collecting matplotlib==3.0.3 (from -r requirements.txt (line 3))
  Downloading https://files.pythonhosted.org/packages/e9/69/f5e05f578585ed9935247be3788b374f90701296a70c8871bcd6d21edb00/matplotlib-3.0.3-cp36-cp36m-manylinux1_x86_64.whl (13.0MB)
    100% |████████████████████████████████| 13.0MB 7.4MB/s 
Collecting numpy==1.16.3 (from -r requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/c1/e2/4db8df8f6cddc98e7d7c537245ef2f4e41a1ed17bf0c3177ab3cc6beac7f/numpy-1.16.3-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
    100% |████████████████████████████████| 17.3MB 2.7MB/s 
Collecting pandas==0.24.2 (from -r requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/19/74/e50234bc82c553fecdbd566d8650801e3fe2d6d8c8d940638e3d8a7c5522/pandas-0.24.2-cp36-cp36m-manylinux1_x86_64.whl (10.1MB)
    100% |████████████████████████████████| 10.1MB 2.3MB/s 
Collecting pyparsing==2.4.0 (from -r requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl (62kB)
    100% |████████████████████████████████| 71kB 11.1MB/s 
Collecting python-dateutil==2.8.0 (from -r requirements.txt (line 7))
  Downloading https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl (226kB)
    100% |████████████████████████████████| 235kB 8.5MB/s 
Collecting pytz==2019.1 (from -r requirements.txt (line 8))
  Downloading https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl (510kB)
    100% |████████████████████████████████| 512kB 88.0MB/s 
Collecting ruamel.yaml==0.16.5 (from -r requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/fa/90/ecff85a2e9c497e2fa7142496e10233556b5137db5bd46f3f3b006935ca8/ruamel.yaml-0.16.5-py2.py3-none-any.whl (123kB)
    100% |████████████████████████████████| 133kB 7.8MB/s 
Collecting ruamel.yaml.clib==0.1.2 (from -r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/96/62/ed93cb8ae7e2ad8c5fe874e8027306aeee0c6a02c04fa015b5f99d14b3db/ruamel.yaml.clib-0.1.2-cp36-cp36m-manylinux1_x86_64.whl (549kB)
    100% |████████████████████████████████| 552kB 18.7MB/s 
Collecting six==1.12.0 (from -r requirements.txt (line 11))
  Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from kiwisolver==1.1.0->-r requirements.txt (line 2))
Installing collected packages: kiwisolver, six, python-dateutil, numpy, pyparsing, matplotlib, pytz, pandas, ruamel.yaml.clib, ruamel.yaml
  Found existing installation: six 1.11.0
    Not uninstalling six at /usr/lib/python3/dist-packages, outside environment /usr
  Found existing installation: python-dateutil 2.6.1
    Not uninstalling python-dateutil at /usr/lib/python3/dist-packages, outside environment /usr
  Found existing installation: numpy 1.13.3
    Not uninstalling numpy at /usr/lib/python3/dist-packages, outside environment /usr
  Found existing installation: pyparsing 2.2.0
    Not uninstalling pyparsing at /usr/lib/python3/dist-packages, outside environment /usr
  Found existing installation: matplotlib 2.1.1
    Not uninstalling matplotlib at /usr/lib/python3/dist-packages, outside environment /usr
  Found existing installation: pytz 2018.3
    Not uninstalling pytz at /usr/lib/python3/dist-packages, outside environment /usr
Successfully installed kiwisolver-1.1.0 matplotlib-3.0.3 numpy-1.16.3 pandas-0.24.2 pyparsing-2.4.0 python-dateutil-2.8.0 pytz-2019.1 ruamel.yaml-0.16.5 ruamel.yaml.clib-0.1.2 six-1.12.0
Ricks-Lab commented 5 years ago

From the output of amdgpu-chk, it looks like you are not running the latest. You should get warnings concerning venv. Also, have you tried to run in a venv? The latest users guide has details on how to set it up.

csecht commented 5 years ago

Okay, thanks. I did an apt dist-upgrade, reloaded the master, and installed and initialized venv, and now everything in amdgpu-utils is working.

On Sep 9, 2019, at 8:14 PM, Rick notifications@github.com wrote:

From the output of amdgpu-chk, it looks like you are not running the latest. You should get warnings concerning venv. Also, have you tried to run in a venv? The latest users guide has details on how to set it up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/39?email_source=notifications&email_token=ALMVCQUTZAFDMNKWOVHWOR3QI3YJDA5CNFSM4ILPVUQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JPYOI#issuecomment-529726521, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMVCQWNXGKZ5U2WRIZ7MN3QI3YJDANCNFSM4ILPVUQQ.

csecht commented 5 years ago

Even with venv running, amdgpu-plot still compresses plots on long runs. This screenshot was after about 1.5 hr. Is it an X-axis scaling issue? amdgpu-plot_longrun

Ricks-Lab commented 5 years ago

This is a known issue that I don't know how to solve yet. I was hoping to come back to it after I gained more matplotlib experience in my next project, but unfortunately, the finance module of matplotlib has been deprecated, so I can not use it in my new project.

The approach that is implemented is that the plot utilities will update the entire plot with every new update. I truncate the dataframe and then re-plot the data frame each update. The preferred approach is to add to the current plot. I need to spend some time researching to figure out how to do that.

Ricks-Lab commented 5 years ago

@csecht I spent a full day digging into the amdgpu-plot code and found several issues and opportunities for improvement. I was able to run the new version overnight without issues on one of my systems. Please give it a try and let me know of any issues.

csecht commented 5 years ago

I downloaded the most recent commit, have amdgpu-plot running now, and will let you know. But the self-check at execution of -plot (in amdgpu-utils-env) and of -ls (while not in amdgpu-utils-env) can no longer report the amdgpu version:

AMD Wattman features enabled: 0xffff7fff
amdgpu version: UNKNOWN
2 AMD GPUs detected, 2 may be compatible, checking...
2 are confirmed compatible.
Ricks-Lab commented 5 years ago

I had some debug mods still in place. I removed them and uploaded fixed version. Thanks!

Ricks-Lab commented 5 years ago

I am planning a release soon. Can you review the utility descriptions in the latest README.md?

csecht commented 5 years ago

Yes, after an overnight run, amdgpu-plot is working perfectly.

csecht commented 5 years ago

Here are README.md edits for consideration:

amdgpu-monitor: May want to add at the end of the section, or in User Guide, that monitor is shutdown with ^C. Is the --plot option necessary, given amdgpu-plot? Also given that the --plot option opens both --gui and --plot windows; in the amdgpu-plot section of the User Guide it is recommended to not have a monitor and a plot function running at the same time because of excess system overhead or something to that effect.

amdgpu-plot: It says, "The --stdin option causes amdgpu-plot to read GPU data from stdin. This is how amdgpu-monitor produces the plot. The benefit of using it in this mode is that both the table and plots are updated with a single read from the driver files." Does this mean that both amdgpu-monitor and amdgpu-plot can be run simultaneously, but from different terminal windows? See above.

In any event, amdgpu-plot --stdin isn't working because in stalls on

amdgpu-plot waiting for initial data.........

unless I'm misunderstanding something about that option.
amdgpu-plot --stdin also executes without displaying the initial system check, as seen with -monitor, -plot, -ls, etc.

amdgpu-pac: Edit "If you have confidence, the --execute_pac option can be used to execute the bash file when saved and then delete it." to, "If you have confidence, the --execute_pac option can be used to execute the bash file when saved; once executed the file is automatically deleted."

amdgpu-pciid: All looks good here. I just want to crow that I added a PCI ID database entry for a "RX 560D OEM OC 2 GB" card, which is in one of my hosts. The name that the PCI ID moderator decided on is longer than what I proposed (and the "GB" part is truncated in the amdgpu-monitor window), but the important bits are displayed.

Ricks-Lab commented 5 years ago

Here are README.md edits for consideration:

amdgpu-monitor: May want to add at the end of the section, or in User Guide, that monitor is shutdown with ^C. Is the --plot option necessary, given amdgpu-plot? Also given that the --plot option opens both --gui and --plot windows; in the amdgpu-plot section of the User Guide it is recommended to not have a monitor and a plot function running at the same time because of excess system overhead or something to that effect.

If you run amdgpu-monitor with the --plot option, a single read of the GPU status is used to update both the plot and monitor. If you run them separately, then both tools will query the GPU resulting in twice as many reads.

amdgpu-plot: It says, "The --stdin option causes amdgpu-plot to read GPU data from stdin. This is how amdgpu-monitor produces the plot. The benefit of using it in this mode is that both the table and plots are updated with a single read from the driver files." Does this mean that both amdgpu-monitor and amdgpu-plot can be run simultaneously, but from different terminal windows? See above.

In any event, amdgpu-plot --stdin isn't working because in stalls on

amdgpu-plot waiting for initial data.........

unless I'm misunderstanding something about that option. amdgpu-plot --stdin also executes without displaying the initial system check, as seen with -monitor, -plot, -ls, etc.

When using the --stdin option, you must pipe data into the process:

cat logfile | ./amdgpu-plot --stdin --simlog

I have modified both the plot and monitor tools to make things more clear.

amdgpu-pac: Edit "If you have confidence, the --execute_pac option can be used to execute the bash file when saved and then delete it." to, "If you have confidence, the --execute_pac option can be used to execute the bash file when saved; once executed the file is automatically deleted."

Good catch. I have modified.

amdgpu-pciid: All looks good here. I just want to crow that I added a PCI ID database entry for a "RX 560D OEM OC 2 GB" card, which is in one of my hosts. The name that the PCI ID moderator decided on is longer than what I proposed (and the "GB" part is truncated in the amdgpu-monitor window), but the important bits are displayed.

When I suggested a change, it was accepted as is. I guess it depends on which moderator checks your input.

I have modified the README.md and the docstrs of all utilities.

Ricks-Lab commented 5 years ago

I do have two other major changes in the plot and monitor utilities:

Let me know if you see any issues with the latest on master.

csecht commented 5 years ago

It looks good. Yes, plot loads much faster now. Should I add a systemd approach to the Setting GPU Automatically at Startup section of the User Guide? I had posted earlier somewhere that the cron approach can be a little flaky and I’ve been having good success using a systemd service to run PAC bash scripts at startup. For either approach, the Card # instability is an issue, so that needs to be addressed in this section also, sometime.

On Sep 14, 2019, at 8:54 AM, Rick notifications@github.com wrote:

I do have two other major changes in the plot and monitor utilities:

I found that on my 5 GPU system, there was a significant chance that the GPUs were being read when the close window was selected, which would cause and error. I found/implemented an easy fix for this. I also found that the monitor window would update sporadically when --plot was used on my 5 gpu system. I fixed this by buffering data writes to the plot process and using flush after writing all GPUs. This should also improve performance of systems with less GPUs. Let me know if you see any issues with the latest on master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ricks-Lab/amdgpu-utils/issues/39?email_source=notifications&email_token=ALMVCQWSABYW23GCBYI6EODQJTUITA5CNFSM4ILPVUQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W4H5Y#issuecomment-531481591, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMVCQSV3HZNE2D36QAQOD3QJTUITANCNFSM4ILPVUQQ.

Ricks-Lab commented 5 years ago

Thanks for checking it out.

I think the systemd approach would be a good addition to the user guide. I have started my travels to the US, so I won’t do the release for at least a week.