GlPortal / RadixEngine

A free and open game engine.
zlib License
148 stars 58 forks source link

Optimized assignment #198

Closed pcodex closed 5 years ago

pcodex commented 5 years ago

Check existing values

pcodex commented 5 years ago

@hhirsch hold off merge until I will mark it ready for review.

pcodex commented 5 years ago

@hhirsch ready for merge

pcodex commented 5 years ago

@hhirsch would this be merged?

hhirsch commented 5 years ago

@wow2006 Can you check this?

ElementW commented 5 years ago

This optimization is misguided and, in fact, will either do nothing or degrade performance. For all platforms where float is 32-bit and the memory bus width is at least 32 bit wide (i.e. all systems glPortal will ever target), 3 things will determine actual run-time overhead of this change:

On any remotely modern system, barring CPU caches, writes are as expensive as reads.

On write-back / write-allocate caches (x86, ARM, ...), a write also costs a read if the address isn't already in cache, so you're virtually not losing time by reading it for the ifs (a cache hit for the later write), but if the compiler doesn't optimize away the conditions (because x, y, z, and w aren't volatile), then you lose time doing a compare operation followed by a branch, to no benefit since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.

On other cache systems (including cache-less), if the conditions don't get optimized, you'd be losing even more time by forcing a read, going from the original 1 memory transaction (a write), to possibly 2 (1 read possibly followed by 1 write)

In the event the compiler optimizes the conditions away, this change is 100% useless. Unfortunately, none of GCC, clang, ICC and MSVC do that optimization (see Compiler Explorer), so this change will just make things worse.

In fact, #197 suffers from the exact same problem.

pcodex commented 5 years ago

>>>since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.

This argument doesn't address a cache miss where the block then has to be read. In such a case without any way to distinguish the dirty bits doesn't the entire cache line get written out to memory?

wow2006 commented 5 years ago

This optimization is misguided and, in fact, will either do nothing or degrade performance. For all platforms where float is 32-bit and the memory bus width is at least 32 bit wide (i.e. all systems glPortal will ever target), 3 things will determine actual run-time overhead of this change:

  • memory write time cost compared to reads
  • the nature of the CPU's cache
  • compiler optimizations

On any remotely modern system, barring CPU caches, writes are as expensive as reads.

On write-back / write-allocate caches (x86, ARM, ...), a write also costs a read if the address isn't already in cache, so you're virtually not losing time by reading it for the ifs (a cache hit for the later write), but if the compiler doesn't optimize away the conditions (because x, y, z, and w aren't volatile), then you lose time doing a compare operation followed by a branch, to no benefit since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.

On other cache systems (including cache-less), if the conditions don't get optimized, you'd be losing even more time by forcing a read, going from the original 1 memory transaction (a write), to possibly 2 (1 read possibly followed by 1 write)

In the event the compiler optimizes the conditions away, this change is 100% useless. Unfortunately, none of GCC, clang, ICC and MSVC do that optimization (see Compiler Explorer), so this change will just make things worse.

In fact, #197 suffers from the exact same problem.

I totally agree with @ElementW