DaemonEngine / Daemon

The Dæmon game engine. With some bits of ioq3 and XreaL.
https://unvanquished.net
BSD 3-Clause "New" or "Revised" License
294 stars 60 forks source link

Optimizing trap calls and giving extra 200 fps to anyone #846

Open illwieckz opened 1 year ago

illwieckz commented 1 year ago

So, I noticed that when I switch a local game with lowest graphics preset from prediction off and on (cg_nopredict off on client or g_synchronousClients 0 on server), performance drops from 1000fps to 500fps.

On an online game using a public server, and ultra graphics preset, disabling prediction gave me 200fps more. I actually reached 500fps on a public game.

It happens that when client-side prediction is enabled, most of the CPU time is spent in CG_PredictPlayerState, and in that function, most of the time is spent in trap_GetUserCmd.

There is such code in CG_PredictPlayerState:

    for ( cmdNum = current - CMD_BACKUP + 1; cmdNum <= current; cmdNum++ )
    {
        // get the command
        trap_GetUserCmd( cmdNum, &cg_pmove.cmd );

This code is calling trap_GetUserCmd 63 times per frame… 😱️

In engine, CMD_BACKUP is 64, this value was already 64 at Quake3 source code release time.. Some comment that were already there at the time also said:

allow a lot of command backups for very fast systems

So, despite the code always using the max, the comment says such max is not for everyone.

My first attempt to save performance without entirely disabling the prediction was to add a cvar that would only fetch a given amount of command backups. And it works. We may still use such cvar in graphics preset to do less prediction on lower one. The good thing with such patch is that it doesn't break engine compatibility.

But, but, but. I assume doing IPC is slow, very slow, compared to just running code directly in CPU cache.

On engine side, the function behind trap_GetUserCmd just does that:

bool CL_GetUserCmd( int cmdNumber, usercmd_t *ucmd )
{
    // cmds[cmdNumber] is the last properly generated command

    // can't return anything that we haven't created yet
    if ( cmdNumber > cl.cmdNumber )
    {
        Sys::Drop( "CL_GetUserCmd: %i >= %i", cmdNumber, cl.cmdNumber );
    }

    // the usercmd has been overwritten in the wrapping
    // buffer because it is too far out of date
    if ( cmdNumber <= cl.cmdNumber - CMD_BACKUP )
    {
        return false;
    }

    *ucmd = cl.cmds[ cmdNumber & CMD_MASK ];

    return true;
}

I see nothing in that function that can eat 200 or 500 fps. But well, IPC is always slower than what can do a code in CPU cache

So, I thought… What if we do a trap_GetUserCmds function that would fetch packets in one go? We would query the whole array of 64 commands in one go (or the amount we would only want if using a cvar to customize this), in a single IPC, and then, iterate over all the commands?

I tried to implement a CL_GetUserCmds function that does just that but my code failed to build because it missed some dedicated Write function.

What do you think about it? To implement the "single trap" I would need some help…

illwieckz commented 1 year ago

See this game-side PR for a cvar allowing the user to choose intermediate values between no prediction and heavy prediction:

See this PR for a WIP not-working not-build attempt to implement a function that would allow the client to fetch all the command backups in one trap call:

illwieckz commented 1 year ago

Here is a screenshot of Orbit profiling the Unvanquished game:

prediction

One can see that CG_DrawActiveFrame spends half its time in CG_PredictPlayerState, then all the time of CG_PredictPlayerState is spent in trap_GetUserCmd.

illwieckz commented 1 year ago

Thanks to @DolceTriade fixing my engine patch, here is how behaves the patch to fetch all backup commands in one call:

Before: you can clearly see on fps graph when I disable (huge fps boost) and enable prediction (huge fps drop):

prediction

After: you cannot notice when I disable and enable prediction (same high fps in both case):

prediction

illwieckz commented 1 year ago

New code running on Orbit:

prediction

CG_DrawActiveFrame now spends only 3% of its time in CG_PredictPlayerState (instead of 51%), then only 52% of CG_PredictPlayerState is spent in trap_GetUserCmds (instead of 96% in trap_GetUserCmd)…

illwieckz commented 1 year ago

So, now when cg_lagometer is off, only one trap_GetUserCmdArray call is done, once for all, which returns both the current cmd number and the whole cmd backup array, I totally removed trap_GetUserCurrentCmd and trap_GetUserCmd. The array fetched by trap_GetUserCmdArray is then used by all functions needing such data, without doing any extra call.

It happens that the code of the lagometer also does one trap_GetUserCmdArray call when enabled, but I wasn't confident at making both code reusing the data because I don't know when each code is run compared to each other.

prediction

So, if I hack the engine to unlock fps above 1000, the default plat23 scene in lowest graphics preset never go under 1000fps on my side. If I want to be crazy and look at the sky, I get 1600fps with some peaks at 1800fps… and if I disable cg_draw2D , the game just runs at more than 3000 fps…

prediction

prediction

prediction

illwieckz commented 1 year ago

Here is running a Release-like engine with Release-like nexe cgame/sgame, still reaching 1000fps:

prediction

illwieckz commented 1 year ago

I've done a similar optimization for trap_R_in_PVS, and when I enable r_smp I can reach 2000fps:

Edit: now pushed in illwieckz/entitypvs/sync.

prediction

illwieckz commented 1 year ago

We could also do a similar optimization to do both trap_S_UpdateEntityPosition and trap_S_UpdateEntityVelocity in one call, this would double the performance of CG_AddPacketEntities:

slow trap

DolceTriade commented 1 year ago

free fps for everyone!

illwieckz commented 1 year ago

So, summary of what's done:

The next performance glutton is trap_CM_MarkFragments, and then after that, trap_R_AddPolyToScene. They're both called by CG_ImpactMark. The cgame calls CG_ImpactMark each time it want's to add a mark. I have a patch that turns CG_ImpactMark into CG_RegisterImpactMark (it just stores the data to be processed later), then at the end of CG_DrawActiveFrame I call a new CG_ProcessImpactMarks function that processes all those registered impact marks. For now this can't bring any performance improvement since it just iterates the registered impact marks and calls trap_CM_MarkFragments the same way as before, but this will allow us to rewrite trap_CM_MarkFragments to send a whole array of impact marks at once. This last part (sending and receiving arrays of such data) isn't an easy task unfortunately, so I make a pause in my development for now.

ghost commented 9 months ago

If I want to be crazy and look at the sky,

I never see any sky on those screenshots?

Otherwise, the point is to prepare the code for an API improvement, right? What's the status of all this? I got curious about it while inspecting the config files to migrate them toward json, the marking system made me thought that it would be possible to have visible footsteps (snow, dirt...) but I did remembered about performance issues in marks, hence I found those discussions.

illwieckz commented 9 months ago

I never see any sky on those screenshots?

Here the idea to look at the sky is to look at a place where there are no entities. Sky textures being rendered or not doesn't count, and especially we better want to disable sky texture rendering to compare framerate right now because sky rendering suffers from a strong performance issue (see #849), and here we're not comparing trap call performance against sky rendering performance, but trap call performance against no trap call performance.

What's the status of all this?

My previous comment is the current status, it is up to date.

illwieckz commented 3 months ago

So, the current implementations for batch calls are:

¹ The first one is less needed since we already have an alternative working implementation, but we may still want it to make the code less convoluted, see this comment and following:

illwieckz commented 3 months ago

On game side, as far as I know every of those cvar checks are doing one trap call per frame:

pkg/unvanquished_src.dpkdir/ui/hud_basics.rml
184:        <if cvar="cg_minimapActive" condition="==" value="1">
191:            <if cvar="cg_drawTimer" condition="==" value="1">
194:            <if cvar="cg_drawClock" condition="!=" value="0">
197:            <if cvar="cg_drawFPS" condition="==" value="1">
203:            <if cvar="cg_lagometer" condition="==" value="1">
224:    <if cvar="cg_drawSpeed" condition="!=" value="0">
illwieckz commented 3 months ago

I wouldn't be surprised if the compiler optimizes by calling the trap call in all cases… 🤦‍♀️️

20240506-090309-000 unvanquished-orbit-trap

illwieckz commented 3 months ago

With all those batched calls plus this:

In stupid “_I look at ATCSHD outside floor with lowest preset and 640×480 resolution with cg_draw2D disabled and rsmp enabled” test the performance went from 3500 fps 10 days ago to 4000 fps today.

April 27:

unvanquished_2024-04-27_023034_000

May 7:

unvanquished_2024-05-07_091823_000

And the 3000 fps reached 10 days ago was mostly lucky, the framerate curve was far less flat, it was more averaging at 3000 fps, while today the game is really flirting with 4000 fps in a stable manner.

Also at the time my computer was almost running nothing but Unvanquished, while right now my system is full of other software messing with the resources (the CPU load in the top-left graph is the one of the whole system).

GPU is AMD Radeon PRO W7600.