Closed yakra closed 6 months ago
*Up-Dte:** TMBitsets for edges are implemented in d218c4d, quite different from this original conceptualization. Only 2 boxes can be checked off (even then, with some edits); the rest are just no longer relevant.
This may or may not amount to much, but it's an idea. Long-term / low priority?
HGVertex
membership has used a bit field since #521, greatly improving speed.TravelerList
membership, as noted in the last post.HGEdge
s. The technique relies on pointer arithmetic to index into the bit field, which requires the objects be stored in contiguous memory. Not so easily done when edges blink into & out of existence during the compression process. This leaves us with each HGEdge
having a written
array, numthreads bytes here & numthreads bytes there, scattered across the heap, with 5/8 of the bits wasted.With some upfront cost, we can implement a bit field after all, and get a lot more info into cache a lot quicker.
static unordered_set
.
A set so we can erase via the memory addx itself and not have to muck about with iterators.HGEdge::detach
can erase from this set when deleting temporary edges.~HGEdge::detach
could go bye-bye completely.
HGEdge
objects stored in a forward_list
or set or whatever, no need to attach anything to keep track of our memory addresses in the first place.erase
the object when format
reaches 0, essentially refactoring that out of detach
.vector
to store HGEdge
objects. Reserve necessary space to avoid resizing & the moves that entails.
*Up-Dte:** Make that a TMArray
. And alloc
ing it, 'cuz that's how TMArrays work.delete
the originals.
Move ctor?
std::list<HGVertex*> intermediate_points
.~HGVertex()
: Don't call HGEdge::detach
anymore. Ownership & deletion managed by the ~vector~ TMArray.HighwayGraph::clear
: Clear the vector.If I really wanna get fancy?
list
and muck about with iterators after all; push_front
& save begin()
to the HGEdge
object. How much RAM will this use? 8 B each. Not bad.First priority though should be commenting out parts of the code and seeing what sections cause the most slowdown.
list
s (or anything I can) to vector
s or TMArray
sback()
instead of front()
during vertex deletion?~
N/A; deletion no longer detaches.this
~
Going the one incident_edges
route instead.std::vector<std::tuple<HGEdge*,HGEdge*,HGEdge*>> incident_edges
~
Going the one incident_edges
route instead.if
s be done? I need coffee.)~
N/A; using one vector.HGEdge::detach
as we know it could potentially cease to exist.
Reintroduce custom dtor, or just have ownership outside the adjacency lists
Update: "as we know it" is the key phrase. HGEdge::detach
still exists, greatly simplified.
Dtor will come back into play with https://github.com/yakra/DataProcessing/issues/264std::vector<HGEdge*> incident_edges
. No tuples. Simple, collapsed & traveled edges can be and often are the exact same object. Reduce the amount of redundant iteration & e->written[threadnum]
checks, and zeroing out e->written[threadnum] (which can go from a bitmask to a simple bool, though char serves that purpose just fine) too if not going the TMBitset route.switch
(and this one) completely.
Can even do this now, though probably microscopically slower & not worth the bother. We lose the conditional jumps per v.visibility
and gain a conditional for attempting to iterate an empty container, plus a few instructions to set up the loop beginning.collapsed_tmg_line
and traveled_tmg_line
: creating/writing segment name (as needed), and iterating intermediate points & formatting the string. Combine into a single function that sets up the data once, then checks the format mask and writes to one or more files.if
sincident_c_edges.size() == 2
. Need to keep a simple_edge_count
during initial construction.edge1
.~std::vector<std::pair<double,double>>
?
8 B more per edge, but will it be faster? Reduced indirection?~foward_list
OTOH could bring sizeof(HGEdge)
down to 40B, but we'd lose the locality & speed benefits of vector
. What outweighs what?~HGVertex*
es as done now, or store 2 double
s to cut down on pointer chasing.
Update: This bullet point has its own issue now.regions
and especially systems
listsCall me cautiously optimistic.
- [ ] Comment things out & see how much things speed up. Do the same for lab3 and/or lab4 for comparison.
- [ ] .sql file
Not writing the DB file in the background saves a decent chunk of time, usually hovering around 10-11 s, even as much as 17.26 s @ 10 threads. Don't know yet how this compares to writing the DB as its own single-threaded task. Easy to find out by running siteupdateST, but bsdlab is running some other speed tests at the moment that I don't want to disrupt. The curve keeps the same overall shape though; this isn't really helping the task scale any better. This deserves further exploration, in conjunction with other solutions.
- [ ] traveled graphs
This snippet of code is pretty expensive: https://github.com/TravelMapping/DataProcessing/blob/80f27243bf26579c53c6c7a215a5aeb22347317a/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp#L270-L274
First, we iterate through an unordered_set
, which is expensive on its own. Then for each item therein, we follow a couple pointers & do some arithmetic, and look up a bool, then maybe do that again and set the bool to 1. (This part is pretty inexpensive; we just do it a whole bunch of times.) Let's gloss over adding the TravelerList*
to traveler_lists
, as that'd still happen in the alternative. More on that below.
What's the impact of commenting this out? Results for bsdlab are still pending, but some quick results for BiggaTomato @ 4 threads are in.
{}//
here to prevent buffer overflows.We still need to get our traveler lists and clinchedby_code, of course. We can't save all of that time, but can save some of it. Hopefully a lot of it.
What's the alternative?
A TMBitset<TravelerList>
will replace the unordered_set<TravelerList*>
for HighwaySegment::clinched_by
. Each subgraph will have a TMBitset
for its travelers. This class has a |=
operator, so we can replace the above code block with traveler_set |= e->segment->clinched_by;
and bitwise-or the sets together. No iterating, no following t->in_subgraph[threadnum]
around, no fuss. As we iterate through graph edges, we'll instantly find & set subgraph membership for between 8 & 64 (TMBitset is still a work in progress) TravelerLists at a time.
Then, after the loop finishes, depending on how I wanna refactor things, either:
std::vector<TravelerList*> traveler_lists(traveler_set.begin(), traveler_set.end());
traveler_lists.insert(traveler_set.begin(), traveler_set.end());
reserve
beforehand. Inserting elements should take the same amount of time either way.All this sets us up for even more efficiency improvements, some mentioned upthread. Combined with other ideas to improve cache locality, this could hopefully put a good dent in graph generation time. Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing. In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.
Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing.
I had to recompile and diff the binaries to be sure.
As seems to be the pattern when I make graph generation changes that benefit Linux, they're just little to no help when compiled on FreeBSD. :sob:
Commenting out the code blocks mentioned above yielded no noticeable time savings. The commented-out version measured slower than 80f2724, so we're within margin of error. (There's quite a bit of variability in FreeBSD's graph generation time between runs, but on average, the big picture is like the chart at the top of the last post.)
The ceiling for the amount of time using TMBitset
to compile traveler lists for traveled graphs is... pretty much zero. :angry:
But TMBitset
can't run slower! It just can't! :confounded:
But like I said,
In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.
And it's not just for graph generation.
Regurgitating the last few bullet points from #527's OP here, TMBitset
should save RAM, and speed up:
The downside, iterating through this class is pretty slow (How slow? Don't know yet; will find out soon), and will slow down Computing stats and writing the segments
SQL table.
Although, I do have a couple ideas to speed up iteration (one quick-n-dirty & probably less effective, one more complex & probably more efficient , not fully thought out yet) and to skip unnecessary iteration entirely.
Find out when & how many edges are "flipped" during compression.
memset
instead of a for loop to zero out cbycode
.
Edit: Tracking this bullet point at yakra#270 instead.Well. This is disappointing. For quite a while now I've considered TMBitsets for subgraph vertex/edge membership as our best shot at getting semi-reasonable graph generation performance out of FreeBSD. But this may not come to pass.
This does wonders on Linux, cutting graph generation time by almost 30% on the right machine @ the right # of threads. (Still running benchmarks on the last few Linux machines; will eventually post a chart at yakra#251 or the pull request if I open it relatively soon.)
But on FreeBSD? Nothing. :frowning_face: In the chart above, sure, the red line is a tiny bit below no-build & partial build most of the time, but not any significant amount. We may just be looking at noise in the data.
Sorry @rickmastfan67, I JUST HAD TO do this!
What's next? TMBitset::shrink_to_fit
.
Vertices are stored in memory (more or less -- more on this below) in the order they come out of the quadtree.
This approximates the geographic clustering most regions & systems will naturally have, thus doing a good job of minimizing the distance between the minimum & maximum HGVertex
pointers in each region or system.
A TMBitset
, you may recall, stores 1 bit for every <item>
in the system, whether it's in the set or not. This leads to oceans of "dead space" of all 0
s in the bit array, before the first & after the last relevant 1
bits.
After a TMBitset
is fully populated, calling shrink_to_fit
will do what it says on the tin, leaving the bit array to refer only to the slice of the parent <item>
array containing all the objects in our set.
A few benefits:
1
, and from the last 1
to the end of the bitset.The master vertex array isn't 100% straightforward though.
The vertices are separated into hi_priority_points
followed by low_priority_points
to get the best results from the Straightforward_intersection
label simplification routine. This means our bitsets will also have a 3rd huge block of 0
bits in the middle too.
No big deal -- some pretty simple tweaks to WaypointQuadtree::graph_points
and HighwayGraph::simplify
will let us have the best of both worlds, both simplifying the labels in the desired order and storing the HGVertex
objects in the order their Waypoints come out of the quadtree.
This will push the vast majority of this dead space out to the beginning & end of each bitset, improving the optimization potential of TMBItset::shrink_to_fit
.
Between all the dead space at the beginning, middle & end, some quick-n-dirty coding & spreadsheeting suggests the regional sets are 95% dead space and the system sets are 89% dead space. So, good potential for improvement here. How much will this all actually help out in the end? Remains to be seen. The TMBitsets themselves are a small amount of the data accessed during graph generation.
Edges behave a little differently.
First, we have the simple edges, created in system->route->segment order, not so closely related to the quadtree.
Followed by all the collapsed/traveled edges, which are created in roughly quadtree order, based on the hidden vertex the edges are collapsed around.
Gaps in the 2nd category will be taken care of by the vertex reordering tweaks mentioned above, but this still leaves a gap between the simple and collapsed/traveled edges.
Again, no big deal. Edges are initially created in a std::list
, then moved to a TMArray
for permanent storage to allow TMBitset
to work its magic. We could either use std::list::sort on the former or std::sort on the latter, whichever performs best. Use whatever sort criterion yields the most compact results; could be something as simple as lhs.vertex1 < rhs.vertex1
.
Aside from this, I'll continue experimenting with other cache/memory locality/bandwidth ideas noted upthread & elsewhere on GitHub.
From what I've seen so far however, effects are likely to be fairly minimal, probably lower impact than TMBitset::shrink_to_fit
.
Nonetheless, I'll keep exploring.
On the CentOS front, lab2 took 3.07 s writing graph files at 5 threads, tantalizingly close to breaking the 3-second barrier.
Unfortunately, I don't have much optimism left for FreeBSD.
Near-term
numthr
branch may be worth looking into eventually, but it's nothing specific to graph generation. It can be its own issue if need be.Long-term
HGVertex::unique_name
from astring*
to aconst char*
will save a little indirection. Going directly to theconst char*
can save us from unnecessarily loading thestring
into cache from main memory, only to grab itsdata()
. Update: Tracking this bullet point at yakra#270 instead.HGEdge::written
is an array ofnumthreads
bytes. If there are 64 threads, they could all be on the same cache line. Even though this doesn't appear to be a problem on Linux, but still. A bit field may help here.HighwayGraph
construction. Anunsigned int
fits in currently unused space. Init to -1.HGEdge
.TravelerList::in_subgraph
HGVertex::s_vertex_num
,c_vertex_num
&t_vertex_num
are scattered all over the heap. One big-ol'unsigned int[vertices.size()][3]
for each thread may help with locality & cache misses. OTOH, this may not help much more than thevnums
solution attempted in5070be4
("v7gs" branch) et al. The data there were laid out so that cache contention should not have been an issue. OT3H, maybe a slight reduction in the amount of arithmetic to get our array indices will be of benefit? Update: This bullet point has its own issue now.HGEdge
size & alignment -> 64TMBitSet
template class I've been dreaming up may help. Comparea5d4b2d
on the "t3" branch...TMBitSet
forHighwaySegment::clinched_by
means:traveler_lists
...TravelerList::traveler_num
.~ Keepingtraveler_num
for now. Using it is actually faster than losing it.TravelerList::in_subgraph
could also go bye-bye. These two items together mean way less chasing pointers all over the heap.e->segment->clinched_by
and check/sett->in_subgraph[threadnum]
for each,traveler_lists
could be found via oneTMBitSet
obtained by bitwise OR-ing together theclinched_by
for each segment.traveler_lists
is nixed altogether? As we iterate thrutraveler_set
, just keep a count instead of creating a vector. The trade-off is doing one more iteration thrutraveler_set
at the end of the traveled graph for traveler names. Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations? Update: Tracking this bullet point at yakra#252 and yakra#270 instead.mvvec
& converting the ad hoc membership structures and functions to aTMBitset
. Way more vertices = higher impact thantraveler_lists
.std::vector<HGEdge*> incident_edges
-- iterate, check each format bit, do stuff.e->written[threadnum]
, not the full set of criteria.~ Edit: Not sure what I meant by the crossed-out bits. As of d218c4d...e->written[threadnum]
, just a TMBitset.|=
them together; fullcustom graphs additionally&=
them together.