ValveSoftware / source-sdk-2013

The 2013 edition of the Source SDK
https://developer.valvesoftware.com/wiki/SDK2013_GettingStarted
Other
3.7k stars 1.99k forks source link

Commented out false sharing debug counters #436

Open ryandmaclean opened 6 years ago

ryandmaclean commented 6 years ago

These changes fix the false sharing of a debug counter in the triangle intersection test of the ray tracer (which is heavily used, particularly for ambient lighting, in vrad.exe). Note that the debug counter doesn't work correctly regardless as the memory operations are non-atomic, thus it's completely useless.

(Note: For testing purposes, max thread count was increased to 32 on both branches, but not included in this pull request. I'd request that would also be done. MAX_THREADS and MAX_TOOLS_THREADS need to be increased from '16' to '32'.)

As threads cross CCX boundaries, cache coherency traffic from false sharing becomes a major bottleneck and overall execution times slow down. Similar effects are seen on dual socket Xeon configurations. On lower core count processors without a split cache hierarchy, only diminishing returns are seen with increased thread counts (thus poor but not negative scaling). This largely solves the scaling problem and even makes multi-socket configurations viable for vrad.

Below: Testing on a Ryzen Threadripper (16 cores, 32 threads) with map "nmo_cleopas" in No More Room in Hell. Build options: -threads 32 -both -final -StaticPropPolys

vrad_thread_scaling

Note: In all of the above tests, affinity masks are set in Process Lasso to simulate the behavior of different CPU core topologies. '8 threads' test is masked to 1 CCX (simulating a typical quad core w/ SMT), '16 threads' test is masked to 2 Ryzen CCX (simulating a Ryzen 7 1800x), 32 threads map to all 4 CCX's (representing a 16 core / 32 thread Threadripper 1950x)

Thus, even on lower core count configurations or CPU configurations with poor cache performance (like AMD FX), performance improvement can be significant, especially when using -final in vrad compile options. Ryzen CPUs benefit the most due to their unusual cache topology.

EDIT: I realized that my title is a bit of a misnomer, the threads aren't really 'false sharing', it's just 'sharing' (with a race condition), but the effect is identical and unintended (i'd imagine), so very much in the same vein.

EDIT 2: Perf improvement on fx-6300 was initially over-stated (I misread the results somehow). I expect the speedup is higher on an FX-8350, however. Note that if you really crank up -extrasky you can approach near-linear scaling as BuildFaceLights takes up a larger fraction of the total execution time, and since BuildFaceLights seems to scale almost linearly with core count now. (assuming sufficiently large scene such that the total work item count is high enough to keep all cores saturated)

SharpOB commented 6 years ago

Nicely done. Given Valve's history with the repo (year since last closed PR), I doubt it'll be added, but nothing is stopping anyone from using the code regardless.

TravisWehrman commented 6 years ago

Excellent find! I'm curious how you discovered this, was this revealed indirectly by a profiler?

ryandmaclean commented 6 years ago

TravisWehrman commented 23 minutes ago Excellent find! I'm curious how you discovered this, was this revealed indirectly by a profiler?

Yes, though I did a lot of testing to isolate it. :)

prototype5885 commented 6 years ago

could you upload the compiled vrad_dll.dll for source 2013?

ryandmaclean commented 6 years ago

could you upload the compiled vrad_dll.dll for source 2013?

Sure! This one is compiled for sdk 2013 mp (I'm not sure if there's a difference between the mp and sp vrads? Haven't really checked). I've also made patched versions for CS:Go and CS:S.

Note: Due to an issue with the sdk 2013 vrad launcher failing to load the new dll (something to do with valve's dll loader?), I've also included a launcher that points to the 'vrad_dll-optimized.dll'. You'll need to point to the new vrad_optimized.exe in your vrad build tools path in the Hammer config to use it.

www.content.tophattwaffle.com/Content/modded_vrad_dll.zip

(Download hosted by TopHattWaffle, he has a lot of good content for level design tutorials etc. as well)

ghost commented 6 years ago

Good job!

ghost commented 6 years ago

After a recent CSGO update(2018/03/29),the vrad needs to update.

ryandmaclean commented 6 years ago

Hey colorsxy, Valve included my fix in a patch about a month after I made the initial pull request, so any manually patched vrad dll's I may have released within that period are no longer relevant to cs: go map makers.

ac1996 commented 5 years ago

Hi, thank you so much for your work. Can you please compile fixed VRAD for SDK 2013 Single Player. When compile map with VRAD_OPTIMIZED in SP - VRAD log say: "Cannot load the static props... encountered a stale map version. Re-vbsp the map."

If set VBSP, VVIS from MP and VRAD_OPTIMIZED -> COMPILE SUCCESSFUL, but hl2.exe application closes, but does not show an error.

I think it would be useful not only for me. And once again, thank you for the work done - you made our life easier.

"After three hourse"

vrad,vbsp-threat-fix.zip

I change raytrace.cpp, released compile - vrad.exe,vrad_dll.dll, vbsp.exe and now it's work fine for SDK 2013 Single Player (SP).

P.S. Thank you. (AMD FX 8320, Compiling speed increased two or even three times)

ellemti commented 5 years ago

is there is just cs-go version? if not is possible to make gmod/garry's mod version

Kefta commented 5 years ago

@ellemti This change was already merged into GMod.

ellemti commented 5 years ago

@Kefta i fell like compile are still longer lol

cblpop commented 5 years ago

My friend, maybe you have information, is this update hit SFM's vrad ???

jmcpit commented 2 months ago

Is this optimized vrad file also available for HL2DM?