Closed dankamongmen closed 4 years ago
perf
indicates that dig_visual_cell()
dominates our runtime, so this could actually be quite useful, especially once O(1) damage maps come into play.
With O(1) damage having been merged last evening, perf
indicates that the vast majority of our time is now being spent in the rectilinear sweep of dig_visible_cell()
. This is embarassingly parallel, and throwing another thread at it could probably cut our rendering latency down by anywhere from ~20% to ~40%, I'd think. Throw two at the render, one at the top and one at the middle, transition the blocking_write()
from a single buffer to a struct iovec
scatter-gather, gate initiation of the writev()
on the first render completing...HOLY SHIT, we can relax that false constraint, too -- if the bottom render finished significantly prior to the top render, just throw a cursor move in that sumbitch, boom motherfuckers! w00000t this seems very promising indeed, and probably especially useful on low-frequency multicores (think Raspberry Pi and similar ARM pootwahcores).
I've broken up rasterizing and rendering, and we actually appear to have picked up a small win from it, perhaps due to better use of cache, not sure. But excellent. I can now toss in a thread and split up at least the rendering step. As noted above, an iovec would then provide an easy path to parallel rasterizing.
Keep a stat on thread-assisted renders if we do this.
I got a version working in the dankamongmen/threaded-render
branch. Doing so required--something I didn't realize initially--locking nc->pool
to work properly. Once this (contested) lock was added, the optimal case of the threaded version--running with no delay, with threads--saw a 12s runtime grow to 17s. Ugh. This was a 104x78 geometry.
I enlarged the geometry to a full screen (382x78), and 43s went to 70s.
We'd need eliminate the lock on nc->pool
to possibly get a win off this, and I just don't care to put more time into this until we're actually slow enough to not exceed all know frame rates by a factor of 10. Sigh.
We could start writing the buffer pretty much immediately if we were to have a second thread picking up at explicit fflush() points, ideally following BUFSIZ-size chunkouts. This ought only be done if there are at least two schedulable units -- the primary render process is entirely computation- and memory-access-bound (so hypercores are fine).