flightlessmango / MangoHud

A Vulkan and OpenGL overlay for monitoring FPS, temperatures, CPU/GPU load and more. Discord: https://discordapp.com/invite/Gj5YmBb
MIT License
6.39k stars 282 forks source link

nvml: only query params that will actually be used #1307

Closed mtijanic closed 5 months ago

mtijanic commented 5 months ago

These NVML calls don't get throttled in the driver and are pretty serializing since they need to talk to PMU and/or GSP. Normally, this isn't a big deal for a game steady state, but any texture churn or similar is likely to hit the KMD and serialize on the same locks. This is unlikely to affect fps much (but see below), but can result in microstutters.

This PR just makes it so we don't query stuff that would be discarded anyway. I'm prototyping another change that spreads out the queries such that they have less impact. In parallel we have some solutions for this on the driver level, but even then (or especially then) it would be wasteful to request data that's not needed.

Testing with __GL_SYNC_TO_VBLANK=0 glxgears and MANGOHUD_CONFIG=gpu_power,gpu_temp,gpu_core_clock,gpu_mem_clock,gpu_fan,vram:

31323 frames in 5.0 seconds = 6264.499 FPS
31040 frames in 5.0 seconds = 6207.880 FPS
32090 frames in 5.0 seconds = 6417.896 FPS

but changing to just export MANGOHUD_CONFIG=gpu_core_clock,gpu_mem_clock:

33845 frames in 5.0 seconds = 6768.915 FPS
34114 frames in 5.0 seconds = 6822.672 FPS
33406 frames in 5.0 seconds = 6681.061 FPS

already shows an fps improvement. However, the bigger improvement can be seen with this bpftrace script:

// Should only run after profiled app is already fully initialized. e.g.:
//     1. start `mangohud glxgears` in one terminal
//     2. Run `bpftrace mangoctrl.bt` in second terminal
//     3. Wait up to 10 seconds
//     4. ctrl+c kill bpftrace
//     5. Exit glxgears
BEGIN {
    @starttime = nsecs;
}

kprobe:nvidia_ioctl {
    if ((arg2 & 0xff) == 0x2A) { // NvRmControl
        if (comm=="glxgears") {
            @nvioctl_nsec[tid] = nsecs;
        }
    }
}

kretprobe:nvidia_ioctl / @nvioctl_nsec[tid] / {
    $elapsed = (nsecs - @nvioctl_nsec[tid]) / 1000;
    @ctrl_stats = stats($elapsed);
    @ctrl_max = max($elapsed);
    @ctrl_hist = hist($elapsed);
    @total_ctrls = @total_ctrls + 1;

    printf("[%lu][%s] NvRmControl took %d us\n", nsecs, comm, $elapsed);
    delete(@nvioctl_nsec[tid]);
}

uprobe:/lib/x86_64-linux-gnu/libGLX.so.0:glXSwapBuffers {
    if (@lastframe != 0) {
        $frametime_us = (nsecs - @lastframe) / 1000;
        @framehist = hist($frametime_us);
        @framestats = stats($frametime_us);
    }
    @lastframe = nsecs;
}

END {
    $measured_time_ms = (nsecs - @starttime)  / 1000000;
    delete(@starttime);
    printf("Total measured time: %d ms\n", $measured_time_ms);
    printf("Controls per second: %d\n", (@total_ctrls * 1000) / $measured_time_ms);
}

I the first case I get: Controls per second: 22 and in the second Controls per second: 8 and overall less microstutter across the apps.

flightlessmango commented 5 months ago

looks good! thanks :+1: