Closed Venemo closed 1 year ago
Check cat /sys/class/drm/card*/device/gpu_busy_percent
I'm observing similar issues on my 5600xt /sys/class/drm/card*/device/gpu_busy_percent itself is showing wrong values, it is not a mangohud issue. That said, other utilities like radeontop or even umr give more reliable results by averaging multiple samples. According to https://www.kernel.org/doc/html/latest/gpu/amdgpu.html, it seems to be a firmware issue, so there is nothing the kernel can do about it. By now i'm doing my own workaround by using the following code:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
static int *values;
static int nsamples = 0;
static double timeout = 1.0;
static const char *fname, *foutname;
int loop(int i, int count) {
FILE *fp = fopen(fname, "r");
unsigned long int sum;
int j;
fscanf(fp, "%d", &values[i]);
fclose(fp);
for (j = 0, sum = 0; j < count; j++) sum += values[j];
double avg = sum / (double) count;
printf("%0.2lf\n", avg); //print continuosly
//...but dump to a file everytime we start to fill a new window:
if (i == 0 ) {
FILE *fpout = fopen(foutname,"w");
fprintf(fpout, "%f", avg );//writing data into file
fclose(fpout);
}
usleep(1000000 * timeout);
}
int main(int argc, const char *argv[]) {
if (argc != 5 || access(argv[3], R_OK) != 0) {
fprintf(stderr, "Samples values from \"filename-input\" at rate \"samples-per-second\" and smooth them over a moving window sized \"window-size\"\n");
fprintf(stderr, "dump continuosly on screen and less frequently to \"filename-output\"\n\n");
fprintf(stderr, "Usage: %s <samples-per-second> <window-size> <filename-input> <filename-output>\n", argv[0]);
return 1;
}
timeout = 1/atof(argv[1]);
nsamples = atoi(argv[2]);
fname = argv[3];
foutname = argv[4];
if (timeout < 0) timeout = 1.0;
if (nsamples < 0) nsamples = 5;
values = malloc(nsamples * sizeof(int));
int i;
for (i = 0; i < nsamples; i++) loop(i, i+1);
while (1) for (i = 0; i < nsamples; i++) loop(i, nsamples);
return 0;
}
./average_and_mirror 50 50 /sys/class/drm/card0/device/gpu_busy_percent /tmp/averaged >/dev/null
It samples the value 50 times per second, builds a sliding average out of it, and dumps the result periodically to /tmp/averaged. Mangohud picks it via:
gpu_stats
exec=cat /tmp/averaged
so that on the left i've the old value and on the right the averaged one.
Could this please be implemented in mangohud directly?
Try to build develop
branch yourself for now.
I've not tested it, because i've not an amd gpu right now. However, what about the following in gpu.cpp:
[..]
if (amdgpu.busy) {
rewind(amdgpu.busy);
fflush(amdgpu.busy);
/*
int value = 0;
if (fscanf(amdgpu.busy, "%d", &value) != 1)
value = 0;
gpu_info.load = value;
*/
int i, sample, sum = 0, samples_max = 50 ;
int sleep_time=int(1000000/samples_max/2); // <-- spend at most 1/2 second to sample
for (i = 0; i < samples_max; i++) {
fscanf(amdgpu.busy, "%d", &sample);
sum += sample;
usleep(sleep_time);
}
gpu_info.load = int(sum/samples_max);
}
[..]
Try to build
develop
branch yourself for now.
Wops, maybe i misunderstood, is there already some work done to address this issue?
Indeed, it seems to work flawlessly in development branch, thanks.
Tested a bit more and it still varies too much. In dying light, i'm facing the ground and gpu use still flies between 50% ant 70%, where radeontop si between 65 and 70. More samples needed?
MangoHud by default samples for 500ms 60ticks, radeontop 1sec 120 ticks. Probably why.
Yeah, i was thinking.. why not ignore bogus values and build an average from other ones? even with 120 samples per second, the average seems not to be trusted. (?) Dying light, facing the ground, and menu screen, everything static:
koko@slimer# time for i in $(seq 1 120) ; do cat /sys/class/drm/card1/device/gpu_busy_percent ; sleep 0.006 ; done | tr \\n ","
99,1,1,3,98,6,97,6,74,99,77,92,66,99,1,1,3,98,24,83,24,87,49,80,99,67,99,1,99,12,99,3,3,12,93,12,58,99,84,99,75,99,16,97,7,97,6,96,24,89,49,65,97,3,98,3,98,2,98,3,98,12,95,24,84,99,1,99,1,99,3,97,6,96,24,89,99,1,99,1,99,3,3,3,98,12,93,99,18,99,35,99,1,99,3,97,10,67,99,1,98,55,99,33,99,1,1,6,97,6,83,49,75,98,60,99,1,99,1,1
real 0m0,924s
user 0m0,134s
sys 0m0,063s
You see in rougly one second and 120 iterations, there are about 30 values that are wrong, imho. I propose to fill an array with samples, build an average out of it, then strip values that differs too much from that average and rebuild another average without those. Or maybe we have to trust those values and the gpu is effectively sleeping here and there?
Or maybe we have to trust those values and the gpu is effectively sleeping here and there?
If the register says it is active/idle then it is or it's a hw/fw bug :shrug:
But gpu_busy_percent
itself seems to be a best effort guess by SMU.
So basically we are left in the cold. Partially (un)related, I was thinking about adding a feature to MangoHud to read a value from a specified file so that one can bypass useless cat(s) via the already provided exec function. Do you think it would be useful with chances to be merged? Also, I never made a PR, bear with me :)
So what exactly is preventing us from using the method radeontop uses to determine GPU usage instead of gpu_busy_percent
?
It's clear that the current usage is so wildly inaccurate that it's nearly useless. Radeontop's isn't however. Getting somewhere close to what it does would be an immense improvement on the status quo of the GPU usage indicator.
It already is though? (v0.6.6, thought it was already on v0.6.5, oh well.) Define "wildly inaccurate"? Seems pretty samey with Vega64.
Oh yeah, it's actually gotten a lot better it seems!
However, I'm currently sitting in GW2's Lion's Arch where Mangohud shows usage in the upper 90s while radeontop shows a more reasonable 70-80%. Turning down resolution doesn't improve FPS, so the GPU is definitely not maxed out as Mangohud suggests.
I'm not sure radeontop is 100% correct here but Mangohud is wrong.
Aha! Didn't see your edit in time.
0.6.6 is a lot more accurate and pretty much on-par with radeontop (if a little jumpy). The issue is resolved in my eyes.
I'm running Horizon Zero Dawn from Steam with latest mesa on an AMD RX 6900XT GPU. I use MangoHud to see the frame rate and some stats, because the game doesn't have this feature built-in.
In this game, MangoHud always reports a very low GPU load, which looks very suspicious. So I also looked at GPU load with
umr --top
and it strongly disagrees with MangoHud.Consider this screenshot: https://drive.google.com/file/d/1WZZj9JQSKQ7zlcC8m922OJpQydmXg5y7/view?usp=sharing
On the left side you can see
umr --top
which reports that GPU_LOAD is 99%, and on the right you can see the game with MangoHud which reports GPU load at 0%. (Note that even though the screenshot shows 0%, MangoHud doesn't always report 0, it varies between 0% and about 40%.)