fabiangreffrath / woof

Woof! is a continuation of the Boom/MBF bloodline of Doom source ports.
GNU General Public License v2.0
186 stars 32 forks source link

Remove 64-bit integer divides from performance sensitive code #1769

Open gendlin opened 2 weeks ago

gendlin commented 2 weeks ago

These are pretty bad from a performance standpoint, see table below (from Agner Fog's instruction latency tables for Nehalem). Just use doubles for high-precision timekeeping instead, and use multiplication instead of division where we can.

operation instruction cycles uops
uint64 divide div 28-90 40
int64 divide idiv 37-100 60

compare to:

operation instruction cycles uops
fp64 divide divsd 7-22 1
int32 divide idiv 17-28 7

and:

operation instruction cycles uops
int64 to double cvtsi2sd 6 2
double to int64 cvtsd2si 5 1
int32 multiply imul 3 1
fp64 multiply mulsd 5 1
rfomin commented 2 weeks ago

I don't think these changes are necessary as this is not performance sensitive code. Again, we have very little calculations in these functions, it's nothing compared to the system calls in SDL_GetPerformanceCounter and Win API functions. I'm sure this improvement is not measurable. In cases like this, I prefer code clarity - it's not obvious why we need to store microseconds in double.

Doom was originally written for 386/486 processors and didn't use float point at all. I recommend looking into m_fixed.h and try using float point calculations in the renderer, you can probably get good measurable results here. Note that we can't use float point in the game simulation (p_* files) due to demo compatibility.

fabiangreffrath commented 1 week ago

I don't know where exactly you've got these numbers from. But I doubt that any recent system with a native 64-bit pointer size performs so much worse when dividing integers. And yet again, we are not "optimizing" this port for 2000-era computers.

rfomin commented 1 week ago

I don't know where exactly you've got these numbers from.

It's probably from these tables: https://www.agner.org/optimize/instruction_tables.pdf Also https://stackoverflow.com/a/55833042

However, I think that optimizing a few instructions in the timer code doesn't improve performance, it's an unnecessary complication.

gendlin commented 1 week ago

I doubt that any recent system with a native 64-bit pointer size performs so much worse when dividing integers

Unfortunately not true, i64 div is very expensive all the way up through Coffee Lake (where it is essentially the same cost as on Nehalem). Coffee Lake was made 2017-2021.

It's probably from these tables: https://www.agner.org/optimize/instruction_tables.pdf

Yes, thanks for the link.

gendlin commented 1 week ago

Some measurements on my machine based on test code here. I_GetTimeUS on master has a 50% overhead, and introduces a big pipeline bubble wherever it is called.

time-test

gendlin commented 1 week ago

it's an unnecessary complication

Check the diff, the resulting code is smaller and fits better with callers that are already doing double arithmetic on time values anyway.

rfomin commented 1 week ago

Some measurements on my machine based on test code here. I_GetTimeUS on master has a 50% overhead, and introduces a big pipeline bubble wherever it is called.

What I meant was that this improvement would not be noticeable if you were measuring game performance. Again - this code is not performance sensitive for us and we do not micro optimise all functions, code clarity is more important. Although it's interesting that QueryPerformanceCounter is so fast, usually system calls are much slower.

fabiangreffrath commented 1 week ago

Doom was originally written for 386/486 processors and didn't use float point at all.

Did some Doom spelunking, there is one instance of 64-bit division here - it's commented out and replaced with doubles.

I have seen that you removed this comment, but I feel the need to reply anyway.

First, it's called FixedDiv() (and not FloatDiv()) for a reason. The code you are referring to is the source released under the linuxdoom moniker. It does not necessasrily match the actual DOS Doom source code, because some functions were translated from ASM to C approximations - among then FixedDiv().

In fact, one of the early things that got fixed in Chocolate Doom was the reversal of the double division back to fixed point division to preserve demo compatibility:

https://github.com/chocolate-doom/chocolate-doom/commit/9815f0f2325ec973c899db766faed87cac49e64d

In the actual reverse-engineered DOS Doom source code the FixedDiv() function is indeed implemented in ASM and calls cdq (convert doubleword to quadword) first and then idiv, so that is clearly a 64-bit integer division:

https://bitbucket.org/gamesrc-ver-recreation/doom/src/bd6beabfd561613b671fe5b8842e7d373731d2c5/doomdef.h#lines-789

gendlin commented 1 week ago

so that is clearly a 64-bit integer division

It's not - ebx is a 32-bit register. There is no i64 division on 386.

preserve demo compatibility

Just to be clear there is no impact on demo compatibility with these changes. Double arithmetic is already being done on these time values elsewhere, I'm not doing anything groundbreaking here.