These NVML calls don't get throttled in the driver and are pretty serializing since they need to talk to PMU and/or GSP. Normally, this isn't a big deal for a game steady state, but any texture churn or similar is likely to hit the KMD and serialize on the same locks. This is unlikely to affect fps much (but see below), but can result in microstutters.
This PR just makes it so we don't query stuff that would be discarded anyway. I'm prototyping another change that spreads out the queries such that they have less impact. In parallel we have some solutions for this on the driver level, but even then (or especially then) it would be wasteful to request data that's not needed.
Testing with __GL_SYNC_TO_VBLANK=0 glxgears and MANGOHUD_CONFIG=gpu_power,gpu_temp,gpu_core_clock,gpu_mem_clock,gpu_fan,vram:
31323 frames in 5.0 seconds = 6264.499 FPS
31040 frames in 5.0 seconds = 6207.880 FPS
32090 frames in 5.0 seconds = 6417.896 FPS
but changing to just export MANGOHUD_CONFIG=gpu_core_clock,gpu_mem_clock:
33845 frames in 5.0 seconds = 6768.915 FPS
34114 frames in 5.0 seconds = 6822.672 FPS
33406 frames in 5.0 seconds = 6681.061 FPS
already shows an fps improvement. However, the bigger improvement can be seen with this bpftrace script:
// Should only run after profiled app is already fully initialized. e.g.:
// 1. start `mangohud glxgears` in one terminal
// 2. Run `bpftrace mangoctrl.bt` in second terminal
// 3. Wait up to 10 seconds
// 4. ctrl+c kill bpftrace
// 5. Exit glxgears
BEGIN {
@starttime = nsecs;
}
kprobe:nvidia_ioctl {
if ((arg2 & 0xff) == 0x2A) { // NvRmControl
if (comm=="glxgears") {
@nvioctl_nsec[tid] = nsecs;
}
}
}
kretprobe:nvidia_ioctl / @nvioctl_nsec[tid] / {
$elapsed = (nsecs - @nvioctl_nsec[tid]) / 1000;
@ctrl_stats = stats($elapsed);
@ctrl_max = max($elapsed);
@ctrl_hist = hist($elapsed);
@total_ctrls = @total_ctrls + 1;
printf("[%lu][%s] NvRmControl took %d us\n", nsecs, comm, $elapsed);
delete(@nvioctl_nsec[tid]);
}
uprobe:/lib/x86_64-linux-gnu/libGLX.so.0:glXSwapBuffers {
if (@lastframe != 0) {
$frametime_us = (nsecs - @lastframe) / 1000;
@framehist = hist($frametime_us);
@framestats = stats($frametime_us);
}
@lastframe = nsecs;
}
END {
$measured_time_ms = (nsecs - @starttime) / 1000000;
delete(@starttime);
printf("Total measured time: %d ms\n", $measured_time_ms);
printf("Controls per second: %d\n", (@total_ctrls * 1000) / $measured_time_ms);
}
I the first case I get: Controls per second: 22 and in the second Controls per second: 8 and overall less microstutter across the apps.
These NVML calls don't get throttled in the driver and are pretty serializing since they need to talk to PMU and/or GSP. Normally, this isn't a big deal for a game steady state, but any texture churn or similar is likely to hit the KMD and serialize on the same locks. This is unlikely to affect fps much (but see below), but can result in microstutters.
This PR just makes it so we don't query stuff that would be discarded anyway. I'm prototyping another change that spreads out the queries such that they have less impact. In parallel we have some solutions for this on the driver level, but even then (or especially then) it would be wasteful to request data that's not needed.
Testing with
__GL_SYNC_TO_VBLANK=0 glxgears
andMANGOHUD_CONFIG=gpu_power,gpu_temp,gpu_core_clock,gpu_mem_clock,gpu_fan,vram
:but changing to just
export MANGOHUD_CONFIG=gpu_core_clock,gpu_mem_clock
:already shows an fps improvement. However, the bigger improvement can be seen with this bpftrace script:
I the first case I get:
Controls per second: 22
and in the secondControls per second: 8
and overall less microstutter across the apps.