poor graph generation performance on FreeBSD

Near-term

Add a commandline switch for single-threaded subgraphs, to retain the benefit from other threaded tasks while staying @ peak efficiency @ 1 SubgraphThread.
Or better yet, have another go at the numthr branch, adding the ability to customize the sweet spot for each task. (See also #519)
Update: These 2 bullet points were intended as workarounds for coping with poor performance. Addressing #585 will fix that, so no real need for these. The numthr branch may be worth looking into eventually, but it's nothing specific to graph generation. It can be its own issue if need be.

Long-term

[x] How much does the HGVertex bit field help? Not much. Down 29.2 -> 24.7 s @ 1 thread. Right back to its old tricks @ 2 threads, taking 49.3 s.
[x] (af9d017) Sequentially stored TravelerLists?
Converting HGVertex::unique_name from a string* to a const char* will save a little indirection. Going directly to the const char* can save us from unnecessarily loading the string into cache from main memory, only to grab its data(). Update: Tracking this bullet point at yakra#270 instead.
Comment things out & see how much things speed up. Do the same for lab3 and/or lab4 for comparison.
- [x] traveled graphs
- [x] I forget whether I actually did comment out edges at any point, but now having found the solution, this is no longer important. :)
- [x] .sql file
Identify structures likely to be accessed by different threads concurrently. Keep cache lines in mind.
- [x] HGEdge::written is an array of numthreads bytes. If there are 64 threads, they could all be on the same cache line. Even though this doesn't appear to be a problem on Linux, but still. A bit field may help here.
- For array indices, do another pass thru the data after HighwayGraph construction. An unsigned int fits in currently unused space. Init to -1.
- As an added bonus, it saves RAM:
  - Don't waste 5 bytes for every 3 used.
  - No need for a pointer in each HGEdge.
- [x] TravelerList::in_subgraph
- HGVertex::s_vertex_num, c_vertex_num & t_vertex_num are scattered all over the heap. One big-ol' unsigned int[vertices.size()][3] for each thread may help with locality & cache misses. OTOH, this may not help much more than the vnums solution attempted in 5070be4 ("v7gs" branch) et al. The data there were laid out so that cache contention should not have been an issue. OT3H, maybe a slight reduction in the amount of arithmetic to get our array indices will be of benefit? Update: This bullet point has its own issue now.
- HGEdge size & alignment -> 64
A TMBitSet template class I've been dreaming up may help. Compare a5d4b2d on the "t3" branch...
- [x] A TMBitSet for HighwaySegment::clinched_by means:
- [x] It comes pre-sorted, so no need to sort traveler_lists...
- [x] ...which could then become a vector. ~Pointer arithmetic then gets us our index for HighwaySegment::clinchedby_code, eliminating the need for TravelerList::traveler_num.~ Keeping traveler_num for now. Using it is actually faster than losing it.
- [x] TravelerList::in_subgraph could also go bye-bye. These two items together mean way less chasing pointers all over the heap.
- [x] Rather than iterate thru e->segment->clinched_by and check/set t->in_subgraph[threadnum] for each, traveler_lists could be found via one TMBitSet obtained by bitwise OR-ing together the clinched_by for each segment.
- What if traveler_lists is nixed altogether? As we iterate thru traveler_set, just keep a count instead of creating a vector. The trade-off is doing one more iteration thru traveler_set at the end of the traveled graph for traveler names. Is this outweighed by not having to construct the vector and do a modest number of allocations/reallocations? Update: Tracking this bullet point at yakra#252 and yakra#270 instead.
- Same idea WRT vertex membership, mvvec & converting the ad hoc membership structures and functions to a TMBitset. Way more vertices = higher impact than traveler_lists.
- [x] Same thing with edge matching. With 3 vectors currently being maintained, that's roughly 3x the (re)allocations of vertices. A potential big-ticket memory bandwidth item?
- [x] Writing edges: same plan as for a single std::vector<HGEdge*> incident_edges -- iterate, check each format bit, do stuff.
- [x] Finding matching edges: There'd still be some redundant membership checks ~for multi-format edges but this can be improved now.~ Test once & add for all formats. ~Subsequent checks will only need to test e->written[threadnum], not the full set of criteria.~ Edit: Not sure what I meant by the crossed-out bits. As of d218c4d...
  - The only redundancy is due to edges being incident to 2 vertices. Nothing due to multiple formats.
  - Even this can potentially be mitigated.
  - There is no e->written[threadnum], just a TMBitset.
  - There's no "full set of criteria" as such anymore. There are criteria for regions, systems, xor PlaceRadius. Multiregion & multisystem graphs |= them together; fullcustom graphs additionally &= them together.

*Up-Dte:** TMBitsets for edges are implemented in d218c4d, quite different from this original conceptualization. Only 2 boxes can be checked off (even then, with some edits); the rest are just no longer relevant.

Edge matching

This may or may not amount to much, but it's an idea. Long-term / low priority?

HGVertex membership has used a bit field since #521, greatly improving speed.
Medium-term, I've similar plans for TravelerList membership, as noted in the last post.
This proves more difficult for HGEdges. The technique relies on pointer arithmetic to index into the bit field, which requires the objects be stored in contiguous memory. Not so easily done when edges blink into & out of existence during the compression process. This leaves us with each HGEdge having a written array, numthreads bytes here & numthreads bytes there, scattered across the heap, with 5/8 of the bits wasted.

With some upfront cost, we can implement a bit field after all, and get a lot more info into cache a lot quicker.

[ ] During construction, add addresses to a static unordered_set. A set so we can erase via the memory addx itself and not have to muck about with iterators.
~HGEdge::detach can erase from this set when deleting temporary edges.~
[ ] HGEdge::detach could go bye-bye completely.
- [ ] Adjacency lists populated after compression is complete. With HGEdge objects stored in a forward_list or set or whatever, no need to attach anything to keep track of our memory addresses in the first place.
- [ ] During compression, just clear the format bits and erase the object when format reaches 0, essentially refactoring that out of detach.
[ ] Keep a running count of total objects as we go.
[x] Create a vector to store HGEdge objects. Reserve necessary space to avoid resizing & the moves that entails. *Up-Dte:** Make that a TMArray. And allocing it, 'cuz that's how TMArrays work.
[ ] After edge compression, iterate through the set, dereference & move the objects into the vector, and delete the originals. Move ctor?
- Implicit works for now, using std::list<HGVertex*> intermediate_points.
- Using a C array (below, other issue) means defining a custom move ctor.
Ownership/deletion:
- [x] ~HGVertex(): Don't call HGEdge::detach anymore. Ownership & deletion managed by the ~vector~ TMArray.
- [ ] HighwayGraph::clear: Clear the vector.

If I really wanna get fancy?

[ ] Don't use a set; use a list and muck about with iterators after all; push_front & save begin() to the HGEdge object. How much RAM will this use? 8 B each. Not bad.
OTOH? We'd lose HGEdge::written, but replace those 8 B with the iterator. Size would still be 64 B, not 56. Not quite as good for cache locality.

First priority though should be commenting out parts of the code and seeing what sections cause the most slowdown.

Convert `list`s (or anything I can) to `vector`s or `TMArray`s

[x] HGVertex::incident_*_edges
- This will make the erasures in HGEdge::detach (used in edge compression and vertex deletion) less efficient. Will this be offset by greater time savings elsewhere?
- ~Detach lists from back() instead of front() during vertex deletion?~ N/A; deletion no longer detaches.
- The time savings may not be all that much; these lists are pretty small.
Or, alternately, forget about erasing and pushing:
- ~vertex1 & vertex2: overwrite pointer with this~ Going the one incident_edges route instead.
- ~vertex, we'd erase. Or potentially just zero out:~ Wat?
- ~Improve locality via std::vector<std::tuple<HGEdge*,HGEdge*,HGEdge*>> incident_edges~ Going the one incident_edges route instead.
- ~When iterating, check whether ptr is valid/null; skip (When can nested ifs be done? I need coffee.)~ N/A; using one vector.
- [x] HGEdge::detach as we know it could potentially cease to exist. Reintroduce custom dtor, or just have ownership outside the adjacency lists Update: "as we know it" is the key phrase. HGEdge::detach still exists, greatly simplified. Dtor will come back into play with https://github.com/yakra/DataProcessing/issues/264
Better yet?
- [x] Have just one std::vector<HGEdge*> incident_edges. No tuples. Simple, collapsed & traveled edges can be and often are the exact same object. Reduce the amount of redundant iteration & e->written[threadnum] checks, and zeroing out e->written[threadnum] (which can go from a bitmask to a simple bool, though char serves that purpose just fine) too if not going the TMBitset route.
- [x] Nix the switch (and this one) completely. Can even do this now, though probably microscopically slower & not worth the bother. We lose the conditional jumps per v.visibility and gain a conditional for attempting to iterate an empty container, plus a few instructions to set up the loop beginning.
- Update: This bullet point has its own issue now. Currently, redundant computation in collapsed_tmg_line and traveled_tmg_line: creating/writing segment name (as needed), and iterating intermediate points & formatting the string. Combine into a single function that sets up the data once, then checks the format mask and writes to one or more files.
- Either use 1 function (or inline it) and use some ifs
- or use a function array or switch for formats 1-7, and write what we need where we need it.
- [x] This breaks edge compression via incident_c_edges.size() == 2. Need to keep a simple_edge_count during initial construction.
- [x] Also breaks determining whether to create single or separate collapsed/traveled edges. There's also an issue for this code block, but that won't help with this.
HGEdge::intermediate_points
- ~May need to write a custom reverse function template. No problem. Easy & inexpensive.~
- ~Or, move the container assignment below here and replace the reversal with a reverse insertion, same as this but from edge1.~
- ~What if it were a std::vector<std::pair<double,double>>? 8 B more per edge, but will it be faster? Reduced indirection?~
- ~A foward_list OTOH could bring sizeof(HGEdge) down to 40B, but we'd lose the locality & speed benefits of vector. What outweighs what?~
- We know how many intermediates there will be; no need for a list or vector. Use a C array & store its size in currently unused space. Either store HGVertex*es as done now, or store 2 doubles to cut down on pointer chasing. Update: This bullet point has its own issue now.
[x] regions and especially systems lists
concurrency lists Update: Tracking this bullet point at yakra#270 instead.

Call me cautiously optimistic.

[ ] Comment things out & see how much things speed up. Do the same for lab3 and/or lab4 for comparison.

[ ] .sql file

Not writing the DB file in the background saves a decent chunk of time, usually hovering around 10-11 s, even as much as 17.26 s @ 10 threads. Don't know yet how this compares to writing the DB as its own single-threaded task. Easy to find out by running siteupdateST, but bsdlab is running some other speed tests at the moment that I don't want to disrupt. The curve keeps the same overall shape though; this isn't really helping the task scale any better. This deserves further exploration, in conjunction with other solutions.

[ ] traveled graphs

This snippet of code is pretty expensive: https://github.com/TravelMapping/DataProcessing/blob/80f27243bf26579c53c6c7a215a5aeb22347317a/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp#L270-L274

First, we iterate through an unordered_set, which is expensive on its own. Then for each item therein, we follow a couple pointers & do some arithmetic, and look up a bool, then maybe do that again and set the bool to 1. (This part is pretty inexpensive; we just do it a whole bunch of times.) Let's gloss over adding the TravelerList* to traveler_lists, as that'd still happen in the alternative. More on that below.

What's the impact of commenting this out? Results for bsdlab are still pending, but some quick results for BiggaTomato @ 4 threads are in.

Writing graph files takes 9.4415 s @ 80f2724. For reference, commenting out the DB brings us down to 7.7896.
Commenting out the code block above yields 7.7205 s with the DB, 6.2718 without. No wait, bad idea. Gotta add a {}// here to prevent buffer overflows.
That brings us to 7.6451 with the DB, 6.2508 without.

We still need to get our traveler lists and clinchedby_code, of course. We can't save all of that time, but can save some of it. Hopefully a lot of it.

What's the alternative? A TMBitset<TravelerList> will replace the unordered_set<TravelerList*> for HighwaySegment::clinched_by. Each subgraph will have a TMBitset for its travelers. This class has a |= operator, so we can replace the above code block with traveler_set |= e->segment->clinched_by; and bitwise-or the sets together. No iterating, no following t->in_subgraph[threadnum] around, no fuss. As we iterate through graph edges, we'll instantly find & set subgraph membership for between 8 & 64 (TMBitset is still a work in progress) TravelerLists at a time. Then, after the loop finishes, depending on how I wanna refactor things, either:

std::vector<TravelerList*> traveler_lists(traveler_set.begin(), traveler_set.end());
traveler_lists.insert(traveler_set.begin(), traveler_set.end());
The latter may be be better because I can reserve beforehand. Inserting elements should take the same amount of time either way.

All this sets us up for even more efficiency improvements, some mentioned upthread. Combined with other ideas to improve cache locality, this could hopefully put a good dent in graph generation time. Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing. In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.

Disappointments, and victories...

Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing.

I had to recompile and diff the binaries to be sure. As seems to be the pattern when I make graph generation changes that benefit Linux, they're just little to no help when compiled on FreeBSD. :sob: Commenting out the code blocks mentioned above yielded no noticeable time savings. The commented-out version measured slower than 80f2724, so we're within margin of error. (There's quite a bit of variability in FreeBSD's graph generation time between runs, but on average, the big picture is like the chart at the top of the last post.) The ceiling for the amount of time using TMBitset to compile traveler lists for traveled graphs is... pretty much zero. :angry: But TMBitset can't run slower! It just can't! :confounded:

But like I said,

In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.

And it's not just for graph generation. Regurgitating the last few bullet points from #527's OP here, TMBitset should save RAM, and speed up:

Processing traveler list files
Augmenting travelers for detected concurrent segments
Creating per-traveler stats logs and augmenting data structure

The downside, iterating through this class is pretty slow (How slow? Don't know yet; will find out soon), and will slow down Computing stats and writing the segments SQL table. Although, I do have a couple ideas to speed up iteration (one quick-n-dirty & probably less effective, one more complex & probably more efficient , not fully thought out yet) and to skip unnecessary iteration entirely.

Find out when & how many edges are "flipped" during compression.

Use memset instead of a for loop to zero out cbycode. Edit: Tracking this bullet point at yakra#270 instead.

Well. This is disappointing. tmb_graph_freebsd For quite a while now I've considered TMBitsets for subgraph vertex/edge membership as our best shot at getting semi-reasonable graph generation performance out of FreeBSD. But this may not come to pass.

This does wonders on Linux, cutting graph generation time by almost 30% on the right machine @ the right # of threads. (Still running benchmarks on the last few Linux machines; will eventually post a chart at yakra#251 or the pull request if I open it relatively soon.)

But on FreeBSD? Nothing. :frowning_face: In the chart above, sure, the red line is a tiny bit below no-build & partial build most of the time, but not any significant amount. We may just be looking at noise in the data.

Sorry @rickmastfan67, I JUST HAD TO do this!

What's next? TMBitset::shrink_to_fit. Vertices are stored in memory (more or less -- more on this below) in the order they come out of the quadtree. This approximates the geographic clustering most regions & systems will naturally have, thus doing a good job of minimizing the distance between the minimum & maximum HGVertex pointers in each region or system.

A TMBitset, you may recall, stores 1 bit for every <item> in the system, whether it's in the set or not. This leads to oceans of "dead space" of all 0s in the bit array, before the first & after the last relevant 1 bits. After a TMBitset is fully populated, calling shrink_to_fit will do what it says on the tin, leaving the bit array to refer only to the slice of the parent <item> array containing all the objects in our set.

A few benefits:

Save several dozen MB of RAM.
(Almost) no more wasting time "meta-iterating" from the beginning of the bitset to the first 1, and from the last 1 to the end of the bitset.
Probably most importantly, no more pulling in loads of unnecessary data from RAM. Which also means avoiding a lot of cache evictions.

The master vertex array isn't 100% straightforward though. The vertices are separated into hi_priority_points followed by low_priority_points to get the best results from the Straightforward_intersection label simplification routine. This means our bitsets will also have a 3rd huge block of 0 bits in the middle too. No big deal -- some pretty simple tweaks to WaypointQuadtree::graph_points and HighwayGraph::simplify will let us have the best of both worlds, both simplifying the labels in the desired order and storing the HGVertex objects in the order their Waypoints come out of the quadtree. This will push the vast majority of this dead space out to the beginning & end of each bitset, improving the optimization potential of TMBItset::shrink_to_fit.

Between all the dead space at the beginning, middle & end, some quick-n-dirty coding & spreadsheeting suggests the regional sets are 95% dead space and the system sets are 89% dead space. So, good potential for improvement here. How much will this all actually help out in the end? Remains to be seen. The TMBitsets themselves are a small amount of the data accessed during graph generation.

Edges behave a little differently. First, we have the simple edges, created in system->route->segment order, not so closely related to the quadtree. Followed by all the collapsed/traveled edges, which are created in roughly quadtree order, based on the hidden vertex the edges are collapsed around. Gaps in the 2nd category will be taken care of by the vertex reordering tweaks mentioned above, but this still leaves a gap between the simple and collapsed/traveled edges. Again, no big deal. Edges are initially created in a std::list, then moved to a TMArray for permanent storage to allow TMBitset to work its magic. We could either use std::list::sort on the former or std::sort on the latter, whichever performs best. Use whatever sort criterion yields the most compact results; could be something as simple as lhs.vertex1 < rhs.vertex1.

Aside from this, I'll continue experimenting with other cache/memory locality/bandwidth ideas noted upthread & elsewhere on GitHub. From what I've seen so far however, effects are likely to be fairly minimal, probably lower impact than TMBitset::shrink_to_fit. Nonetheless, I'll keep exploring. On the CentOS front, lab2 took 3.07 s writing graph files at 5 threads, tantalizingly close to breaking the 3-second barrier. Unfortunately, I don't have much optimism left for FreeBSD.

TravelMapping / DataProcessing