aristocratos / bpytop

Linux/OSX/FreeBSD resource monitor
Apache License 2.0
10.14k stars 396 forks source link

[REQUEST] Add GPU monitoring #26

Open tom-doerr opened 4 years ago

tom-doerr commented 4 years ago

Is your feature request related to a problem? Please describe. I'm always frustrated when I have to open another window to monitor the usage of my GPUs.

Describe the solution you'd like I would love to see the GPU utilization and memory usage in bpytop when bpytop detects dedicated GPUs. Users that made the effort to install a dedicated GPU or bought a computer with a dedicated GPU likely care a lot about the GPU usage.

Describe alternatives you've considered An alternative would be to open nvtop or run watch nvidia-smi.

aristocratos commented 3 years ago

@jiwidi Still haven't gotten the GPU I ordered in September last year, but will start work on this when I get it.

BenBE commented 3 years ago

Any viable options to display those information with the nouveau drivers? Got two systems available for testing …

jiwidi commented 3 years ago

@jiwidi Still haven't gotten the GPU I ordered in September last year, but will start work on this when I get it.

oh man, gpu shortage sucks. I know the feeling

aristocratos commented 3 years ago

@BenBE

Any viable options to display those information with the nouveau drivers? Got two systems available for testing …

Probably not unless nvidia-smi is provided by the nouaveau driver. And if it isn't, I'm not sure I see much point in writing a separate collection function for it (if possible). The thinking being that if you have a need to monitor your gpu performance, you're probably doing work that would benefit from the performance of the official Nvidia drivers and would be very unlikely to run the nouveau drivers.

BenBE commented 3 years ago

@BenBE

Any viable options to display those information with the nouveau drivers? Got two systems available for testing …

Probably not unless nvidia-smi is provided by the nouaveau driver. And if it isn't, I'm not sure I see much point in writing a separate collection function for it (if possible).

As nouveau is built-in to the vanilla kernel, it doesn't make much sense to use utilities that are written to access the proprietary driver's features. One reason why people use the nouveau driver is exactly to not be dependent on closed-source software.

The thinking being that if you have a need to monitor your gpu performance, you're probably doing work that would benefit from the performance of the official Nvidia drivers and would be very unlikely to run the nouveau drivers.

Not everybody is trying to use every last bit of performance in their hardware and a system monitoring tool be able to give a good grasp on the system resource usage for as many common situations as possible. And people running nouveau is IMHO quite common (given that nv is more often than not a hassle to setup (and maintain).

NB: There seems to be hwmon stuff in sysfs for nouveau … Couldn't find some attributes mentioned to exist for amdgpu there though …

aristocratos commented 3 years ago

@BenBE

As nouveau is built-in to the vanilla kernel, it doesn't make much sense to use utilities that are written to access the proprietary driver's features.

You're saying that it doesn't make sense to use the drivers and utilities provided by the company that made the graphics card? That seems like weird reasoning when the nouveau drivers perform terrible and doesn't seem to even give the same level of usage statistics.

I have nothing against supporting the nouveau drivers, but I'm pretty sure there is a pretty small minority of people who use bpytop while using these drivers and also care about monitoring the performance of their cards. So it's gonna be pretty low priority.

BenBE commented 3 years ago

@BenBE

As nouveau is built-in to the vanilla kernel, it doesn't make much sense to use utilities that are written to access the proprietary driver's features.

You're saying that it doesn't make sense to use the drivers and utilities provided by the company that made the graphics card?

No, I'm saying that some people prefer to use OpenSource software on their systems.

That seems like weird reasoning when the nouveau drivers perform terrible and doesn't seem to even give the same level of usage statistics.

They work well enough. In fact I had more trouble with the nv drivers comapred to nouveau …

I have nothing against supporting the nouveau drivers, but I'm pretty sure there is a pretty small minority of people who use bpytop while using these drivers and also care about monitoring the performance of their cards. So it's gonna be pretty low priority.

Can live with that as long as there's reasonable support for it …

ccasadei commented 3 years ago

I use GPUs on a daily basis at work to train neural networks on corporate servers. I already use monitoring tools like "nvtop" or "nvidia-smi" to check temperatures and resource occupancy. If I found everything included in "bpytop", it would be ... the TOP ... :-)

ayan-iiitd commented 3 years ago

I would like to know if there is any update on this, maybe a test build?

webbp commented 3 years ago

@ccasadei possibly interesting alternative: gotop --nvidia

antonio258 commented 3 years ago

Any implementation with gpus amd?

ccasadei commented 3 years ago

@ccasadei possibly interesting alternative: gotop --nvidia

I do not know... I'm used to using "bpytop" by now. I have customized its interface and am used to using it also to kill processes or change the nice value. I'd rather have everything available as a bpytop's plugin.

HarlemSquirrel commented 3 years ago

This sounds cool and I'm happy to help if I can. I have a little python experience and created this Ruby gem https://github.com/HarlemSquirrel/amdgpu-fan-rb

GaetanLepage commented 3 years ago

Hi! Any new on this issue ? It could be a really good feature :)

zampierilucas commented 3 years ago

@jorge-barreto haven't found your initial implementation on your fork, any chance you still have it?

jorge-barreto commented 3 years ago

Hey @zampierilucas, I'm unsure what you mean. My fork is many commits behind the main branch, but it currently has the WIP pictured above and below. Steps to repro:

git clone https://github.com/jorge-barreto/bpytop.git
cd bpytop
./bpytop.py

Screenshot from 2021-08-25 12-05-42 Note: this build is not currently stable.

zampierilucas commented 3 years ago

Got it working for Nvidia cards with py3nvml 🥳.

2021-08-27_16-51

@jorge-barreto do you plan to merge your AMD implementation with upstream? if so we might be able to work together in creating a common interface for NVIDIA and AMD or at least simplifying the codebase. Otherwise, I might scrap the AMD code, as I don't have an AMD card to test it. I've rebased your fork with aristocratos master, the rebased code as well as the Nvidia WIP code can be found at https://github.com/zampierilucas/bpytop/tree/nvidia_gpu_monitor.

jorge-barreto commented 3 years ago

Hey @zampierilucas, if this approach (using py3nvml) seems fine to @aristocratos, then I'm more than happy to hop aboard and finish the AMD side of this!

GaetanLepage commented 3 years ago

@zampierilucas I tested you fork and it works really great ! Thanks to all of the contributors for this feature. A live graph (like in nvtop) would make it perfect :)

aristocratos commented 3 years ago

@jorge-barreto @zampierilucas Nice work!

If you guys get all the collection functionality working for both amd and nvidia I can open up a new branch to push it to. Then when you guys feel it's working properly (error handling, etc.) I can fix any UI issues present, add graphs, autosizing, size constraint fixes and so on and then merge it in to the main branch.

Does that sound good to you?

Also need to make sure to make py3nvml an optional external dependency since it has nvidia copyright.

zampierilucas commented 3 years ago

@GerbenWelter Great :D just pushed a few fixes, keep on testing. I'll let the Aristocratos make the final call on the UI, but I would definitely vouch for a graph :)

zampierilucas commented 3 years ago

@aristocratos np, I've done some UI improvements in my fork, so now all boxes should at least interact correctly, but hey If you want to scramble all that I don't mind :)

@jorge-barreto I've pushed a new commit that should correctly differentiate between AMD and NVIDIA cards, at least on Linux, so when you have time, please try to undo the mess that I did on the AMD side :p

aristocratos commented 3 years ago

@zampierilucas I was thinking of doing something like this: image CPU in top half and GPU in bottom half, then gpu clock, vram, temps, etc. in the smaller box to the right.

Which could be toggled like how disks can be toggled in the mem box. That would avoid obscuring the proc box (which would break on lower terminal sizes when info box is also showed for a process) and also give a better historical view of the gpu usage graph.

I've opened a branch called gpu that can be used until everything is stable.

zampierilucas commented 3 years ago

@aristocratos I think that's a great idea, but IMHO I would like to have a graph of both GPU and gpu_mem usage in that bottom half, would that be doable?

what do you think If I merge my changes, and Jorge rebased code, too your gpu branch? so the three of us can focus on merging everything on that branch until the code is ready to go to master :)

aristocratos commented 3 years ago

@zampierilucas

I think that's a great idea, but IMHO I would like to have a graph of both GPU and gpu_mem usage in that bottom half, would that be doable?

Can have smaller graphs that expands to fill empty space in the smaller box for any gpu stat that is suitable. Having multiple graphs in the bottom half of the main graph would mean more bloat and special cases and toggles that I'm not sure it's worth. Considering a gpu memory graph will likely be static 90% of the time it might be better suited for a small graph (like the graphs for the cores).

what do you think If I merge my changes, and Jorge rebased code, too your gpu branch? so the three of us can focus on merging everything on that branch until the code is ready to go to master :)

That was my intention for creating it :) Will add you and @jorge-barreto as collaborators so you can push to the branch without having to wait for my permission.

zampierilucas commented 3 years ago

@aristocratos

Having multiple graphs in the bottom half of the main graph would mean more bloat and special cases and toggles that I'm not sure it's worth.

You're right, I think is just my, used to nvtop, brain speaking lol

Great, merged my branch into yours :D. I'll focus on stabilizing the Nvidia side, as well as adding multi-gpu support.

yochananmarqos commented 3 years ago

I tried out the gpu branch, but it appears it doesn't like that my GeForce GTX 1660 Ti Mobile doesn't have a fan speed sensor.

error.log

zampierilucas commented 3 years ago

@yochananmarqos that's weird, can you confirm that you're running the Nvidia driver? also depending on your distro you might need the CUDA libraries, I know that I had to install it on my fedora machine

yochananmarqos commented 3 years ago

@zampierilucas Yes, I have 470.63.01 installed. Installing Cuda 11.4.1 made no difference.

zampierilucas commented 3 years ago

@yochananmarqos I've completely glanced over the fact that It doesn't have a fan sensor, I think I need to consider the fact that some gpu might not have a fan sensor, codewise.

aristocratos commented 3 years ago

@zampierilucas It might be a good idea to wrap any values that's not guaranteed to exists in try blocks and return NotImplemented or similar if it fails. That way the draw function also knows what to ignore.

Supporterino commented 3 years ago

Hey guys just wanted to give you some feedback as well. I am running a XPS 15 9500 with Ubuntu 20.04 and no GPU stats are showing up for my GTX 1650 Ti mobile. The weird thing is bpytop crashes with this error when i close the menu:

16/09/21 (08:20:36) | ERROR: Uninitialized
Traceback (most recent call last):
  File "./bpytop.py", line 6110, in main
    run()
  File "./bpytop.py", line 6104, in run
    process_keys()
  File "./bpytop.py", line 5855, in process_keys
    Menu.main()
  File "./bpytop.py", line 4542, in main
    nvml.nvmlShutdown()
  File "/home/lars/.local/lib/python3.8/site-packages/py3nvml/py3nvml.py", line 1163, in nvmlShutdown
    fn = _nvmlGetFunctionPointer("nvmlShutdown")
  File "/home/lars/.local/lib/python3.8/site-packages/py3nvml/py3nvml.py", line 736, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_UNINITIALIZED)
py3nvml.py3nvml.NVMLError_Uninitialized: Uninitialized
16/09/21 (08:20:36) | WARNING: Exiting with errorcode (1). Runtime 0:00:24 
ghost commented 3 years ago

I'm getting an error on the GPU branch, heres my error.log: error.log

zampierilucas commented 3 years ago

@yochananmarqos The latest commit should deal with the "No Fan issue". @Supporterino @pjalsGit Can you guys retest with the latest commit?!

yochananmarqos commented 3 years ago

@zampierilucas

error.log

ghost commented 3 years ago

Updated error log: error.log

(by the way im using AMDGPU, should have mentioned that)

HarlemSquirrel commented 3 years ago

For me the gpu branch fails because it's expecting to find hwmon0 but I only have hwmon1.

18/09/21 (23:42:38) | ERROR: Data collection thread failed with exception: [Errno 2] No such file or directory: '/sys/class/drm/card0/device/hwmon/hwmon0/'
Traceback (most recent call last):                                                                   
  File "/home/hs/code/bpytop/./bpytop.py", line 3069, in _runner                                     
    collector._collect()                                                                             
  File "/home/hs/code/bpytop/./bpytop.py", line 4380, in _collect                                    
    cls._get_gpus()                                                                                  
  File "/home/hs/code/bpytop/./bpytop.py", line 4242, in _get_gpus                                   
    cls._get_stat_nums(gpu[0])                                                                       
  File "/home/hs/code/bpytop/./bpytop.py", line 4190, in _get_stat_nums                              
    with os.scandir(cls._get_hwmon(card)) as files:                                                  
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/drm/card0/device/hwmon/hwmon0/'

I changed bpytop.py:4422 to

cls.hwmon: str = "/device/hwmon/hwmon1/"

And get a little farther but this error is thrown

18/09/21 (23:53:37) | ERROR: Data collection thread failed with exception: 'card0'                   
Traceback (most recent call last):                                                                   
  File "/home/hs/code/bpytop/./bpytop.py", line 3069, in _runner                                     
    collector._collect()                                                                             
  File "/home/hs/code/bpytop/./bpytop.py", line 4380, in _collect                                    
    cls._get_gpus()                                                                                  
  File "/home/hs/code/bpytop/./bpytop.py", line 4242, in _get_gpus                                   
    cls._get_stat_nums(gpu[0])                                                                       
  File "/home/hs/code/bpytop/./bpytop.py", line 4196, in _get_stat_nums                              
    arr = cls.stat_nums[card][stat_keys[i]]                                                          
KeyError: 'card0'

I have a Sapphire RX 5700 XT

HarlemSquirrel commented 3 years ago

I forked the repo and made a few changes to the gpu branch to get it loaded up.

I'm not seeing much in the display but most of the stats appear to be loaded based on the logs I added.

image

19/09/21 (15:54:34) | DEBUG: GPU card: card0 name:
19/09/21 (15:54:34) | DEBUG: stat: {'fans': {'fan1': (929, '3300')}, 'freqs': {'freq1': ('306', 'sclk'), 'freq2': ('500', 'mclk')}, 'power': {'power1': <function GpuCollector._get_power.<locals>.<lambda> at 0x7f1d13128ee0>}, 'volts': {'volt0': (900, 'vddgfx')}, 'vitals': {'vram_used': 673341440, 'vram_total': 4294967296, 'vram': 15.677452087402344, 'temp1': ('29', 'edge')}, 'load': {'mem': 0, 'gpu': 0}}
19/09/21 (15:54:34) | DEBUG: stat_nums {'fans': [('1', '3300')], 'freqs': [('1', 'sclk'), ('2', 'mclk')], 'temps': [('1', 'edge')], 'power': ['1'], 'volts': [('0', 'vddgfx')]}
Supporterino commented 3 years ago

@zampierilucas Sadly it still show nothing for me. And when i go into options it chrashes again with the same error. Now i also get the following error:

Traceback (most recent call last):
  File "./bpytop.py", line 2667, in _runner
    collector._collect()
  File "./bpytop.py", line 3730, in _collect
    if not cls.populated: cls._get_gpus()
  File "./bpytop.py", line 3675, in _get_gpus
    cls._get_stat_nums(gpu[0])
  File "./bpytop.py", line 3634, in _get_stat_nums
    with os.scandir(cls._get_hwmon(card)) as files:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/drm/card0/device/hwmon/hwmon0/'
16/09/21 (08:13:48) | WARNING: Exiting with errorcode (1). Runtime 0:00:01 

16/09/21 (08:14:03) | ERROR: Data collection thread failed with exception: [Errno 2] No such file or directory: '/sys/class/drm/card0/device/hwmon/hwmon0/'
Traceback (most recent call last):
  File "./bpytop.py", line 2667, in _runner
    collector._collect()
  File "./bpytop.py", line 3730, in _collect
    if not cls.populated: cls._get_gpus()
  File "./bpytop.py", line 3675, in _get_gpus
    cls._get_stat_nums(gpu[0])
  File "./bpytop.py", line 3634, in _get_stat_nums
    with os.scandir(cls._get_hwmon(card)) as files:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/class/drm/card0/device/hwmon/hwmon0/'

I checked the path which couldn't be loaded and it turns out the path /sys/class/drm/card0/device doesn't have a hwmon subdirectory

zampierilucas commented 3 years ago

@HarlemSquirrel Press "5" to open the GPU tab ;)(and '4' twice after to fix the window proportions :p)

HarlemSquirrel commented 3 years ago

@zampierilucas I got the gpu box now! Draw is not working through image

HarlemSquirrel commented 3 years ago

I made a few more adjustments on my fork to fix the above issues. https://github.com/HarlemSquirrel/bpytop/commit/de7959be6a95b7ad5faec45fa1339127c3939ea7 image

I would like to see a meter of the fan output but the data is different than clocks and mem so it seems to need a different meter format and it's not clear to me how to make that change.

zampierilucas commented 3 years ago

@HarlemSquirrel I'm not sure either :thinking:, but don't worry too much about it, some of the data between NVidia and AMD are different anyway(e.g. Core voltage clock is only reported on enterprise cards).

Hypoon commented 3 years ago

Just leaving my two-cents here. I look forward to having this capability in bpytop. For my use-cases, it would be ideal to display the load on multiple GPUs over time (much like the CPU charts and meters). I wonder if video-memory utilization could be worked into the existing memory module.

HarlemSquirrel commented 3 years ago

I was really surprised to see this application seems to be one giant file :grimacing: and no dependencies. However, given the complexity of reading GPU metrics it's worth considering moving all that to separate libraries for AMD and Nvidia. I would be happy to help out on the AMD side and could port over much of what I have in https://github.com/HarlemSquirrel/amdgpu-fan-rb/ to Python if this project is interested.

Hypoon commented 3 years ago

However, given the complexity of reading GPU metrics it's worth considering moving all that to separate libraries for AMD and Nvidia.

I agree. Putting the GPU-related code in a separate file may offer practical advantages by helping to isolate optional dependencies. I wouldn't want the introduction of GPU-related features to create headaches for people on headless systems.

ayan-iiitd commented 2 years ago

Is there any update on this? the branch shows it is last updated on Sep 2021.

Patola commented 2 years ago

Yeah, give us GPU monitoring, pleeeeaaase?

jfaldanam commented 2 years ago

Afaik, is being worked on the btop project, which from my understanding is where all new developing will be focused on (maybe someone more involved can confirm this).

Is currently being work on as confirmed here.

sanxchep commented 5 months ago

this still on?