Closed luisalvarado closed 8 months ago
The wiki says "It comes down to whether sensor information [for the GPU] is available in /proc or /sys", and as far as I know, the proprietary nvidia driver allows querying the temperature sensor through the nvidia-smi command only, it's not in /proc or /sys.
The wiki also says "Vitals uses asynchronous file reading to help prevent your system from lagging on every refresh. Executing a command line utility would cause unresponsiveness and is not something I want Vitals to exhibit." But I think it might be possible to prevent this unresponsiveness by calling the nvidia-smi tool through the async gio SubProcess or SubProcessLauncher.
I might have some time to give that a try and submit a pull request if I get it to work, if that would be acceptable for Vitals maintainers?
@w-flo Thank you for quoting the Wiki. I get this request one or two times per month. At this point, I am definitely open to a pull request.
My previous reason for saying no was based around system freezing bug reports, as in my opinion shelling to the disk can sometimes cause this. Users of Vitals are very sensitive to this, but if we add a checkbox in settings (that part I can help with) then that allows the users to decide if they want to poll their video card stats.
Thanks for the feedback. The idea of adding this is good, but after reading your comments am all against it now. Super bad idea if it needs to query nvidia-smi.
Maybe an idea would be to query in bigger intervals the GPU related information, instead of more realtime like the rest of the data for Vital. Would this also mean Freon is lagging the system or creating too many calls?
I attempted to make nvidia-smi queries work without launching a new subprocess every couple of seconds, and it seems to work for me in preliminary testing. See #315
Maybe there is some issue with this approach that I'm missing though.
I will review this in detail tonight. The technique described sounds like a perfect compromise!
I have merged @w-flo's work into the develop branch. I was unable to test myself because my Linux laptop is using Nouveau and I don't see an easy way to switch it to nvidia drivers under Fedora.
@luisalvarado can you test the develop branch and share your findings?
Yes sir. Will test today. Thank you.
Here are my findings:
TEMPERATURE:
FAN SPEED:
FINAL LOOK:
The ones I could see that could be worth it are:
CURRENT POWER USAGE CURRENT MEMORY USAGE CURRENT MEMORY SPEED CURRENT CLOCK SPEED TOTAL GPU UTILIZATION
If it helps, the line that I use to get the information I normally gather from the video card is:
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1
And that shows the following (Look at the last options which are the most useful):
Also the icons for the graphics card would be needed. You can see Freon in the images above showing a PCI card as the icon so the confusion with different temperature devices does not arise like here:
NOTE: For some reason, when I disable/enable Freon, the whole Gnome environment freezes and I need to wait about 5 seconds for it to be functional. Could be only Freon, but I am mentioning this, in case by adding the Graphics card support would create such an issue. I do not want to create lag with Vital like Freon has.
LASTLY, After testing the Fan Speed and Temperature here:
I found that the Fan Speed was not being updated. It did grab the speed the first time, then went to 0 and stayed there. Even though the fans could turn off when not in used, if they turn back on, they would stay in 0 for Vital.
The GPU Temperature on the other hand was changing properly, although a couple of seconds behind, but I fully understand the reasons why it can NOT be in real time.
For the fan speed, you can add the fan.speed parameter to the query, so it looks like this:
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,clocks.sm,clocks.mem,clocks.gr,fan.speed --format=csv -l 1
and the outcome would look something like this (I am changing the fan speed so you can see the difference in the image:
Did this help?
Thanks for testing, @luisalvarado! So this is the command I currently use for the Vitals implementation because fan speed and temperature (and voltage, but that's not available in nvidia-smi) are "general purpose" sensor groups in Vitals:
nvidia-smi --query-gpu=name,temperature.gpu,fan.speed --format=csv,noheader,nounits -l [seconds]
.
I agree that the other sensors, like power usage / GPU memory usage / clock speeds / GPU utilization would be interesting. But we would probably need a new "GPU" main level group in Vitals, next to "Temperature", "Fan", "Memory" and so on, because these sensors don't really fit into any existing group, I believe. For example, the existing "Memory" group focuses on system main memory, so I'm not sure if VRAM could be added to the same group. So the UI could be pretty confusing in the end (I'm afraid I'm a terrible UI designer).
About the fan speed not updating properly and being grayed out in the one screenshot: That means there was an error, maybe a bug in my code? For example, it will happen if you kill the nvidia-smi subprocess spawned by Vitals or if the nvidia-smi process exits for other reasons. Like when reading the process output fails. I haven't implemented an "auto-restart" feature because I was worried about spamming the system with repeated attempts to re-start nvidia-smi, and I guess the process should just stay alive if everything is working correctly.
If this happens without you manually killing the nvidia-smi process, I need to look into it. Was there anything "special" that you did before you noticed this issue, like maybe suspending the system for some time, or changing some nvidia driver settings, or anything that I could do to reproduce the issue? It has never happened to me (only when I "killall nvidia-smi" on purpose for testing). You should be able to restart the background process if this happens again by changing the update interval in Vitals settings, or disabling/re-enabling the experimental Nvidia feature in Vitals settings.
Oh my, my apologies, the Fan was not grayed out. It was working, I happen to have the mouse over the option when I took the picture, but at that moment it was showing proper value. It is after (a couple of seconds later, when it stopped getting proper value). Sorry for the confusion.
In regards to suspending or anything, no. Just started taking pictures and showing you the results. Actually while I type it shows like this:
For restarting, actually yes, if I reboot it works again for a couple of seconds. Are you by chance the UI Designer of the app?
I'm not the UI designer, this piece of code that fetches data from nvidia-smi is the first time I've worked with this app's code.
Alright, it's surprising that it just stays at 0%! I will look into this next week. Another issue is that "min" and "avg" fan speeds are now calculated incorrectly, for example 25% is counted as "0.25 RPM", and I'm not really sure what to do about it.
Also, I'm worried about memory use. The gnome-extensions "gjs" process is at 83,4 MB for me, according to gnome-system-monitor, and I don't remember it taking up significant amounts of memory. Maybe that's normal for Vitals though, I had only used it for one day before starting the nvidia-smi related development. But I do wonder if I've introduced some kind of memory leak somewhere, maybe only after using it for a longer time.
Normally on the video cards (Well Nvidia in this case), it does not tell you the actual RPM but the percent in which it is supposed to be running. Fan RPM will vary by model and even inside the same card (Like here I have 3 fans but 2 of them run a bit faster than the last one). So I would stick with the percent and not worry about the RPM.
For the gjs, it shows 16MB on my end, but I just started using the computer like an hour ago.
Well, maybe the way fan speed average / minimum / maximum values are calculated should be changed when percentage values are involved, at least. Maybe just ignore percentages in those calculations. Let's see what @corecoding thinks about this :-)
Another possible issue: Recently, gnome-shell likes to crash just before my system is going into suspend-to-RAM mode (literally in the last second before suspending): kernel: gnome-shell[1480]: segfault at 20 ip 00007fc845f3c990 sp 00007fff49b8c9a0 error 4 in libmutter-11.so.0.0.0[7fc845e4a000+149000]
. That's annoying because all applications will crash because of a broken pipe once suspend-to-RAM ends (on wayland) and I have to re-login. I'm not sure if that could be related to my changes in Vitals? I don't think this happened before, but now it happened 3 times in 2 days. Maybe something is not handling "async read in gjs is active while going into suspend" so well and it somehow affects that gnome-shell process, even though Vitals should run in the gjs process? Or maybe because I'm actively working on the extension files while it is loaded and enabled.
Anyway, I'll try to see if I can reproduce the "fan speed doesn't update after it hit 0" issue today.
I just sent a pull request #317 that might fix the issue with nvidia-smi values sometimes not updating. It was pretty rare in my testing and happened only when changing the update interval, so maybe there is some other condition that could trigger this (maybe related to changing settings in the nvidia-driver?).
Thank you for the update @w-flo
That was a lot of testing.
Thank you @luisalvarado for noticing that "fails to update" bug! :-) We really tested a lot.
The "gjs uses more memory" thing seems to be fine. Apparently, the extension code actually runs in the gnome-shell process itself, not the gjs "gnome.Shell.Extensions" process. I'm not 100% sure what the gjs process does, maybe it runs settings windows. Anyway, I'm pretty sure I haven't noticed that process before because I didn't interact with extension settings etc. So that should be okay.
So I believe the only remaining known issues are:
Awesome. Well for the questions here you go
What to do about percentage fan speed? Just ignore percentages while calculating the average, min and max values?
Actually show only the percentage and ignore any other values. For fans it is rare and also difficult /not useful to measure the rpm for a video card.
Should we add more GPU data like GPU utilization, memory data, and so on – and what's a good UI design for that? Maybe a new sub-menu (group) for each Nvidia GPU installed on the system could work
Give me a moment for this one. I can hire a designer for it and provide an idea for it with her.
Is there anyone else who has an Nvidia card who can test this dev build?
@corecoding Hi, I tested with an Nvidia 4090 and 1080 last time. Should I test again?
@luisalvarado I haven't made any changes to this feature, I was hoping someone else from the community could give it a try... May take a moment for someone to chime in though, as users have to proactively go to this page
You mean I can go in and and submit my pull request?
@corecoding I have a laptop running a 1050ti and a (new) laptop running a 3070ti, both using the Nvidia driver, I'm happy to test it on both but I'm afraid Ive not done any Gnome extension development before so am fairly new to this.
If you can give me some hints on how to get the dev branch running and what info you need back from me, I'll spin it up on both this week and see what comes back
Edit: Apologies, I've just spotted instructions in the README, will follow up shortly
@corecoding
1050ti
3070ti
Tested on the 3070ti by running the OpenGl benchmarking test __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glmark2
which showed a jump of 16c within a few seconds, so appears to be updating correctly
Only thing I'd say, and this is probably outside the scope of this, is that it might be nice to have different icons for different temperature types. I imagine that if you wanted temperatures visible on the top bar, CPU and GPU temp would be a common combination, but with them both using the same thermometer icon it's not immediately clear which is which.
Is there anything else you want me to test?
I was talking to the visual designer I work with, she did some of the icons like this:
These would be the base for the GPU section and then a smaller svg next to it showing the Temp, Fan Speed, CPU Clock. The sub icons would be tied to the GPUs icons here so anyone looking at them knows "A ok this is the GPU Temp" and not confuse it with the CPU Temp. Same for clock, fan speed, etc..
Moved the code to the https://github.com/corecoding/Vitals/tree/nvidia branch, I released support for gnome 44 today and needed to use that branch because it had other changes.
Hey no worries @corecoding thank you for that. Do note, if I update vital and I am using Xorg, it will crash.
I am guessing this is still been tested. i will check out the nvidia branch. Thank you.
Hey no worries @corecoding thank you for that. Do note, if I update vital and I am using Xorg, it will crash.
I am guessing this is still been tested. i will check out the nvidia branch. Thank you.
https://github.com/corecoding/Vitals/wiki/Broken-after-update should help.
Idk if tests are still needed, but tried now with the nvidia branch and looks good, immediately update with running a game.
@luisalvarado @w-flo I pushed some changes to the nvidia
branch. I think the best way to handle GPU metrics is by adding a dedicated GPU section. This is rough draft code, but I wanted to share it to encourage contribution. I will make the menu dropdown GPU instead of Gpu.
Can you see if my changes are working? I'd love to see a screenshot with the GPU sensor values being in the GPU section.
@corecoding sure thing boss. Can you give me a day to fix an issue here then I will do a video for you. So you can check everything on it. If there is anything else let me know, I will be happy to help and test. The tests will be done in a RTX 2060 Laptop, GTX 1080 PC and RTX 4090 PC.
@corecoding I can see the menu item, but nothings in it. GPU temp is also still showing in the Temperature section. Is that expected at this stage?
Am also getting a NaN for GPU fan, but not sure if that's an unrelated issue
All data displayed seems to be correct. I'm running a strange fanless RTX A2000, the temp on it it's 37 °C, "Fan 0 RPM" is also correct as at the time of the screenshot the CPU fan was off (I'm running a semi-passive PC with a single fan that kick-in when the CPU temp is above 50 °C). Don't know what "Gpu: No Data" means.
Tested on Gnome 44 / Fedora 38
I'm not sure whats happened since the last update but I swtched to the nvidia
branch and now the GPU has disappeared from the temperature menu, and I have a new GPU menu showing "No Data".
This was working on the old branch https://github.com/corecoding/Vitals/issues/313#issuecomment-1450180725
Sorry for being so absent. Was a beta tester for Ubuntu 23.04. Okay so Ubuntu 23.04, What I am seeing is that the option for enabling Nvidia will automatically turn itself off even if you enable it:
If you enable it, even though it shows enabled, it is actually off. Because of this, it does not show the Nvidia option in the menu.
@luisalvarado, I don't have this issue in Fedora 38.
I had to reboot (The option as mentioned above would auto disable itself), but afterwards it shows like this:
I am available to share my screen and test with anyone if they like and if that would help. My email is luisalvarado@ubuntu.com
There are 2 cards that can be used to test. A GTX 1080 and a RTX 4090.
In case it helps, I am on Ubuntu 23.04 with the Nvidia 530.41.03 so maybe this has to do with the GPU saying no data. Does the extension have an error log somewhere that I could see or play around with? This is a fresh install again of Ubuntu but the No Data still shows.
@corecoding any plans to merge this to main?
@trofosila It is not code complete yet. Not sure if the others want to put more time into this feature?
@xulres sorry, didn't know you were waiting on me! I think your approach sounds perfect.
Btw, I did make a recent changed that moved the core Gnome files into a sub directory in the repo, if you don't want to deal with the (likely) code conflicts, I can clean things up.
Making note of these two GPU related tickets, I am guessing they can be closed when this feature is released.
https://github.com/corecoding/Vitals/issues/366 https://github.com/corecoding/Vitals/issues/359
any update on this?
Guess thats dead also
This is a community driven feature. I don't have the hardware to test it myself, plus (as mentioned above) using command line utilities to grab sensors can create lags which, at the core, is why I created Vitals. At the time, some other system monitoring tools created lag.
What I am looking for is to have a group of individuals (could be as many as 2 or 3) come in and say yes, this works, or no it doesn't and here is a patch to make it work. Once we get to a point where it works for everyone, I will merge it in.
Hi @corecoding I have in the past tested it on Gnome 44 / Fedora 38 and everything was fine for me (see comment above). I will retest the nvidia branch in the coming days on Debian 12 and Fedora 39. I'll report back. Hope others can confirm it works or report if there are any issues.
Ive tested this previously. On March 1st last year it worked great with no visible lag.
But then after you said you'd moved the branch, I updated and tested again (April 23rd) and it had stopped working, though it wasnt clear why. Both attempts I posted comments on this issue.
I'm not sure if anythings changed between April and now, but I'll reinstall and check again next week
I need to do my computer from scratch. Too much garbage so that's why I have not reported yet.
Just tested, I do not see it anywhere. IS there a special parameter or something I gotta do. It was working fine last year if that helps,.
Can you try the develop-nvidia branch? This is the code from Arastais. I tried it and it appeared to be the best yet. I had one issue with N/A values and he has fixed it.
@corecoding Yes now it shows on the settings
But there is no option to select the Nvidia 4090 or anything Nvidia related. I am using the one from today.
Has this issue been covered in the Wiki?
Is there an existing issue reported already?
Describe the new feature you would like
This is a GREAT TOOL to say the least. But I have been using it together with Freon which is also another great tool. The issue is, I use both because Freon offers the overall CPU temp and the Nvidia card temp as seen here:
This is how it looks right now:
So a great option would be to detect a video card temp and maybe rename the average CPU temp option because right now it shows as Processor 0 which confused me a bit for a 13900k which has 32 threads.