coraxx / netdata_nv_plugin

NetData plugin for Nvidia GPU stats
87 stars 14 forks source link

Cannot enable plugin #15

Closed MaxMatti closed 3 years ago

MaxMatti commented 3 years ago

Not sure if I'm doing anything wrong or if this is an issue with netdata or with this plugin or if my card (2070S) is not supported, but I tried installing this plugin and there is no new chart section showing up in netdata - I expected a new section that contains the GPU temperature, fanspeed, etc.

My system:

Ryzen 3700X on B450 and a RTX 2070 SUPER CPU: AMD Ryzen 3700X RAM: 32 GB DDR4-3600 Motherboard: ASRock Fatal1ty B450 Gaming-ITX/AC AMD B450 8GB Zotac Gaming GeForce RTX 2070 SUPER AMP, GDDR6, HDMI, 3x DP (ZT-T20710D-10P) (in case that's relevant)

Archlinux (last update and reboot about 2h before opening this issue) Netdata v1.29.3

What I did:

git clone https://github.com/Splo0sh/netdata_nv_plugin --depth 1
cd netdata_nv_plugin
sudo cp nv.chart.py /usr/lib/netdata/python.d
sudo cp python_modules/pynvml.py /usr/lib/netdata/python.d/python_modules
sudo cp nv.conf /etc/netdata/python.d
sudo systemctl restart netdata.service

Chart list after running above commands and refreshing netdata:

image

I then thought I had to enable plugins manually:

cd /etc/netdata
sudo ./edit-config netdata.conf
sudo systemctl restart netdata.service

I appended this snippet to the config:

[plugins]
    python.d = yes

Then I reloaded again but still didn't find any new chart section.

Searching for "nvidia" or "temperature" also did not lead me to any sections that weren't previously there.

MaxMatti commented 3 years ago

I should've mentioned running nvidia-smi works:

Sat Feb 27 19:28:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 207...  Off  | 00000000:0A:00.0  On |                  N/A |
| 57%   30C    P8     3W / 215W |    377MiB /  7979MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       971      G   /usr/lib/Xorg                     231MiB |
|    0   N/A  N/A      1704      G   /usr/bin/kwin_x11                   2MiB |
|    0   N/A  N/A      1773      G   /usr/bin/plasmashell               90MiB |
|    0   N/A  N/A      1953      G   /usr/bin/nextcloud                  8MiB |
|    0   N/A  N/A      2088      G   ...akonadi_archivemail_agent        2MiB |
|    0   N/A  N/A      2106      G   .../akonadi_mailfilter_agent        2MiB |
|    0   N/A  N/A      2111      G   ...n/akonadi_sendlater_agent        2MiB |
|    0   N/A  N/A      2112      G   ...nadi_unifiedmailbox_agent        2MiB |
|    0   N/A  N/A     16959      G   ...AAAAAAAA== --shared-files       25MiB |
+-----------------------------------------------------------------------------+
coraxx commented 3 years ago

Hi, sadly I currently do not have a NVIDIA card in a Linux system. But I just noticed something? I just checked my NetData installation on one of my Debian machines and the paths differ now from the README here. python.d e.g. is now under '/usr/lib/netdata/conf.d/python.d' (conf.d added).

Also it seems that python modules live somewhere else now: /usr/libexec/netdata/python.d/python_modules/

If that makes it work, I have to update the README. But you would be a big help if you could try out to copy it to the appropriate places and try again :)

coraxx commented 3 years ago

I think I'm confusing myself now with the paths: These are the differences from your installation and the usual paths:

Readme: /usr/libexec/netdata/python.d/ /usr/libexec/netdata/python.d/python_modules/

yours: /usr/lib/netdata/python.d /usr/lib/netdata/python.d/python_modules

Can recheck? Maybe the scripts just ended up in the wrong folders :)

MaxMatti commented 3 years ago

Yes, I did modify the paths, because there's no /usr/libexec/netdata in my filesystem. Should I create that folder instead?

I checked by running this:

$ sudo find / -name "python.d" 2>/dev/null
/etc/netdata/python.d
/usr/lib/netdata/conf.d/python.d
/usr/lib/netdata/python.d
coraxx commented 3 years ago

Maybe a debug run can tell us more. Try: /usr/lib/netdata/plugins.d/python.d.plugin nv debug trace

MaxMatti commented 3 years ago

Seems to me like the 2070 SUPER is somewhere marked as not supported?

$ /usr/lib/netdata/plugins.d/python.d.plugin nv debug trace
2021-03-02 00:01:14: python.d INFO: plugin[main] : using python v3
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : looking for 'python.d.conf' in ['/etc/netdata', '/usr/lib/netdata/conf.d']
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : loading '/usr/lib/netdata/conf.d/python.d.conf'
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : '/usr/lib/netdata/conf.d/python.d.conf' is loaded
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : looking for 'pythond-jobs-statuses.json' in /var/lib/netdata
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : loading '/var/lib/netdata/pythond-jobs-statuses.json'
2021-03-02 00:01:14: python.d WARNING: plugin[main] : error on loading '/var/lib/netdata/pythond-jobs-statuses.json' : PermissionError(13, 'Permission denied')
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : [nv] looking for 'nv.conf' in ['/etc/netdata/python.d', '/usr/lib/netdata/conf.d/python.d']
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : [nv] loading '/etc/netdata/python.d/nv.conf'
2021-03-02 00:01:14: python.d DEBUG: plugin[main] : [nv] '/etc/netdata/python.d/nv.conf' is loaded
2021-03-02 00:01:14: python.d INFO: plugin[main] : [nv] built 1 job(s) configs
2021-03-02 00:01:14: python.d INFO: nv[nv] : 'nvMemFactor' set to: 1
2021-03-02 00:01:14: python.d INFO: nv[nv] : Nvidia Driver Version: b'460.56'
2021-03-02 00:01:14: python.d DEBUG: nv[nv] : Unit count: 0
2021-03-02 00:01:14: python.d DEBUG: nv[nv] : Device count 1
2021-03-02 00:01:14: python.d DEBUG: nv[nv] : Not Supported
2021-03-02 00:01:14: python.d DEBUG: nv[nv] : Device 0 : b'GeForce RTX 2070 SUPER'
2021-03-02 00:01:14: python.d WARNING: plugin[main] : nv[nv] : unhandled exception on check : IndexError('list index out of range'), skipping the job
2021-03-02 00:01:14: python.d INFO: plugin[main] : no jobs to serve
2021-03-02 00:01:14: python.d INFO: plugin[main] : exiting from main...
$ ls -hal /var/lib/netdata/pythond-jobs-statuses.json
-rw-rw---- 1 netdata netdata 46  2. Mär 00:00 /var/lib/netdata/pythond-jobs-statuses.json
$ sudo cat /var/lib/netdata/pythond-jobs-statuses.json
{
  "sensors": {
    "sensors": "active"
  }
}
coraxx commented 3 years ago

Can you please comment out or delete line 342 self.debug("Brand:", str(brands[brand])) in nv.chart.py and try again?

MaxMatti commented 3 years ago

Seems like I should've waited a few minutes for a response before going to bed...

$ /usr/lib/netdata/plugins.d/python.d.plugin nv debug trace
2021-03-02 15:58:02: python.d INFO: plugin[main] : using python v3
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : looking for 'python.d.conf' in ['/etc/netdata', '/usr/lib/netdata/conf.d']
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : loading '/usr/lib/netdata/conf.d/python.d.conf'
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : '/usr/lib/netdata/conf.d/python.d.conf' is loaded
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : looking for 'pythond-jobs-statuses.json' in /var/lib/netdata
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : loading '/var/lib/netdata/pythond-jobs-statuses.json'
2021-03-02 15:58:02: python.d WARNING: plugin[main] : error on loading '/var/lib/netdata/pythond-jobs-statuses.json' : PermissionError(13, 'Permission denied')
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : [nv] looking for 'nv.conf' in ['/etc/netdata/python.d', '/usr/lib/netdata/conf.d/python.d']
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : [nv] loading '/etc/netdata/python.d/nv.conf'
2021-03-02 15:58:02: python.d DEBUG: plugin[main] : [nv] '/etc/netdata/python.d/nv.conf' is loaded
2021-03-02 15:58:02: python.d INFO: plugin[main] : [nv] built 1 job(s) configs
2021-03-02 15:58:02: python.d INFO: nv[nv] : 'nvMemFactor' set to: 1
2021-03-02 15:58:02: python.d INFO: nv[nv] : Nvidia Driver Version: b'460.56'
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : Unit count: 0
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : Device count 1
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : Not Supported
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : Device 0 : b'GeForce RTX 2070 SUPER'
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Temp      : 37
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Mem total : 8366784512 bytes
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Mem used  : 669581312 bytes
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Mem free  : 7697203200 bytes
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Load GPU  : 1 %
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Load MEM  : 10 %
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Load ENC  : 0 %
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Load DEC  : 0 %
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Core clock: 300 MHz
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' SM clock  : 300 MHz
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Mem clock : 405 MHz
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' Fan speed : 58 %
2021-03-02 15:58:02: python.d DEBUG: nv[nv] : b'GeForce RTX 2070 SUPER' ECC errors: None
2021-03-02 15:58:02: python.d INFO: nv[nv] : Graphics Card(s) found: b'GeForce RTX 2070 SUPER' [0]
2021-03-02 15:58:02: python.d INFO: plugin[main] : nv[nv] : check success
2021-03-02 15:58:02: python.d WARNING: plugin[main] : nv[nv] : registration failed: [Errno 13] Permission denied: '/var/lib/netdata/lock/nv.collector.lock', skipping the job
2021-03-02 15:58:02: python.d INFO: plugin[main] : no jobs to serve
2021-03-02 15:58:02: python.d INFO: plugin[main] : exiting from main...
coraxx commented 3 years ago

Okay looks good :D the permission error I guess has something to do with access permissions (netdata running as different user than the debug run). Have you tried to restart netdata to take a look if it is showing up now?

MaxMatti commented 3 years ago

It does show up, sorry for the delayed reply. Also a cups-section showed up that previously didn't:

image

Thank you very much for your help!

Edit: It also seems to deliver the correct values image
coraxx commented 3 years ago

Wonderful news. I will update the repo with the fix soon. The cups thing I have no idea yet. What metrics does it show? ^^

MaxMatti commented 3 years ago

It just shoes some metrics related to printers, so not really relevant to me. Knowing nothing about netdata I suspect it's being run after nvidia and thus wasn't executed previously because that thread never got so far.

screenshot image