Closed pcodex closed 5 years ago
@hhirsch hold off merge until I will mark it ready for review.
@hhirsch ready for merge
@hhirsch would this be merged?
@wow2006 Can you check this?
This optimization is misguided and, in fact, will either do nothing or degrade performance.
For all platforms where float
is 32-bit and the memory bus width is at least 32 bit wide (i.e. all systems glPortal will ever target), 3 things will determine actual run-time overhead of this change:
On any remotely modern system, barring CPU caches, writes are as expensive as reads.
On write-back / write-allocate caches (x86, ARM, ...), a write also costs a read if the address isn't already in cache, so you're virtually not losing time by reading it for the if
s (a cache hit for the later write), but if the compiler doesn't optimize away the conditions (because x
, y
, z
, and w
aren't volatile
), then you lose time doing a compare operation followed by a branch, to no benefit since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.
On other cache systems (including cache-less), if the conditions don't get optimized, you'd be losing even more time by forcing a read, going from the original 1 memory transaction (a write), to possibly 2 (1 read possibly followed by 1 write)
In the event the compiler optimizes the conditions away, this change is 100% useless. Unfortunately, none of GCC, clang, ICC and MSVC do that optimization (see Compiler Explorer), so this change will just make things worse.
In fact, #197 suffers from the exact same problem.
>>>since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.
This argument doesn't address a cache miss where the block then has to be read. In such a case without any way to distinguish the dirty bits doesn't the entire cache line get written out to memory?
This optimization is misguided and, in fact, will either do nothing or degrade performance. For all platforms where
float
is 32-bit and the memory bus width is at least 32 bit wide (i.e. all systems glPortal will ever target), 3 things will determine actual run-time overhead of this change:
- memory write time cost compared to reads
- the nature of the CPU's cache
- compiler optimizations
On any remotely modern system, barring CPU caches, writes are as expensive as reads.
On write-back / write-allocate caches (x86, ARM, ...), a write also costs a read if the address isn't already in cache, so you're virtually not losing time by reading it for the
if
s (a cache hit for the later write), but if the compiler doesn't optimize away the conditions (becausex
,y
,z
, andw
aren'tvolatile
), then you lose time doing a compare operation followed by a branch, to no benefit since the cache will already avoid writing the value back if it hasn't changed, and if it did, would do so at a later time, not slowing down the assignment.On other cache systems (including cache-less), if the conditions don't get optimized, you'd be losing even more time by forcing a read, going from the original 1 memory transaction (a write), to possibly 2 (1 read possibly followed by 1 write)
In the event the compiler optimizes the conditions away, this change is 100% useless. Unfortunately, none of GCC, clang, ICC and MSVC do that optimization (see Compiler Explorer), so this change will just make things worse.
In fact, #197 suffers from the exact same problem.
I totally agree with @ElementW
Check existing values