TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

poor graph generation performance on FreeBSD #518

Closed yakra closed 6 months ago

yakra commented 2 years ago

Near-term

Long-term

yakra commented 1 year ago

*Up-Dte:** TMBitsets for edges are implemented in d218c4d, quite different from this original conceptualization. Only 2 boxes can be checked off (even then, with some edits); the rest are just no longer relevant.


Edge matching

This may or may not amount to much, but it's an idea. Long-term / low priority?

With some upfront cost, we can implement a bit field after all, and get a lot more info into cache a lot quicker.

If I really wanna get fancy?


First priority though should be commenting out parts of the code and seeing what sections cause the most slowdown.

yakra commented 1 year ago

Convert lists (or anything I can) to vectors or TMArrays

yakra commented 1 year ago

https://github.com/TravelMapping/DataProcessing/blob/7ec7e9a5aa13e8eed9cbe8ca6a2798c268d47652/siteupdate/cplusplus/classes/GraphGeneration/HGEdge.cpp#L46-L55

Change to if/else.

yakra commented 1 year ago

Call me cautiously optimistic.

  • [ ] Comment things out & see how much things speed up. Do the same for lab3 and/or lab4 for comparison.
    • [ ] .sql file

db Not writing the DB file in the background saves a decent chunk of time, usually hovering around 10-11 s, even as much as 17.26 s @ 10 threads. Don't know yet how this compares to writing the DB as its own single-threaded task. Easy to find out by running siteupdateST, but bsdlab is running some other speed tests at the moment that I don't want to disrupt. The curve keeps the same overall shape though; this isn't really helping the task scale any better. This deserves further exploration, in conjunction with other solutions.


  • [ ] traveled graphs

This snippet of code is pretty expensive: https://github.com/TravelMapping/DataProcessing/blob/80f27243bf26579c53c6c7a215a5aeb22347317a/siteupdate/cplusplus/classes/GraphGeneration/HighwayGraph.cpp#L270-L274

First, we iterate through an unordered_set, which is expensive on its own. Then for each item therein, we follow a couple pointers & do some arithmetic, and look up a bool, then maybe do that again and set the bool to 1. (This part is pretty inexpensive; we just do it a whole bunch of times.) Let's gloss over adding the TravelerList* to traveler_lists, as that'd still happen in the alternative. More on that below.

What's the impact of commenting this out? Results for bsdlab are still pending, but some quick results for BiggaTomato @ 4 threads are in.

We still need to get our traveler lists and clinchedby_code, of course. We can't save all of that time, but can save some of it. Hopefully a lot of it.

What's the alternative? A TMBitset<TravelerList> will replace the unordered_set<TravelerList*> for HighwaySegment::clinched_by. Each subgraph will have a TMBitset for its travelers. This class has a |= operator, so we can replace the above code block with traveler_set |= e->segment->clinched_by; and bitwise-or the sets together. No iterating, no following t->in_subgraph[threadnum] around, no fuss. As we iterate through graph edges, we'll instantly find & set subgraph membership for between 8 & 64 (TMBitset is still a work in progress) TravelerLists at a time. Then, after the loop finishes, depending on how I wanna refactor things, either:

All this sets us up for even more efficiency improvements, some mentioned upthread. Combined with other ideas to improve cache locality, this could hopefully put a good dent in graph generation time. Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing. In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.

yakra commented 1 year ago

Disappointments, and victories...

Looking at the preliminary results for bsdlab as they start to come in, it's uncertain how much this (alone at least) will help scaling on FreeBSD. No magic bullet yet, but probably better than nothing.

I had to recompile and diff the binaries to be sure. As seems to be the pattern when I make graph generation changes that benefit Linux, they're just little to no help when compiled on FreeBSD. :sob: Commenting out the code blocks mentioned above yielded no noticeable time savings. The commented-out version measured slower than 80f2724, so we're within margin of error. (There's quite a bit of variability in FreeBSD's graph generation time between runs, but on average, the big picture is like the chart at the top of the last post.) The ceiling for the amount of time using TMBitset to compile traveler lists for traveled graphs is... pretty much zero. :angry: But TMBitset can't run slower! It just can't! :confounded:

But like I said,

In any case, I'll be moving forward with this, as it's sure to speed things up on Linux.

And it's not just for graph generation. Regurgitating the last few bullet points from #527's OP here, TMBitset should save RAM, and speed up:

The downside, iterating through this class is pretty slow (How slow? Don't know yet; will find out soon), and will slow down Computing stats and writing the segments SQL table. Although, I do have a couple ideas to speed up iteration (one quick-n-dirty & probably less effective, one more complex & probably more efficient , not fully thought out yet) and to skip unnecessary iteration entirely.

yakra commented 1 year ago

Find out when & how many edges are "flipped" during compression.

yakra commented 1 year ago
yakra commented 7 months ago

Well. This is disappointing. tmb_graph_freebsd For quite a while now I've considered TMBitsets for subgraph vertex/edge membership as our best shot at getting semi-reasonable graph generation performance out of FreeBSD. But this may not come to pass.

This does wonders on Linux, cutting graph generation time by almost 30% on the right machine @ the right # of threads. (Still running benchmarks on the last few Linux machines; will eventually post a chart at yakra#251 or the pull request if I open it relatively soon.)

But on FreeBSD? Nothing. :frowning_face: In the chart above, sure, the red line is a tiny bit below no-build & partial build most of the time, but not any significant amount. We may just be looking at noise in the data.

Sorry @rickmastfan67, I JUST HAD TO do this!


What's next? TMBitset::shrink_to_fit. Vertices are stored in memory (more or less -- more on this below) in the order they come out of the quadtree. This approximates the geographic clustering most regions & systems will naturally have, thus doing a good job of minimizing the distance between the minimum & maximum HGVertex pointers in each region or system.

A TMBitset, you may recall, stores 1 bit for every <item> in the system, whether it's in the set or not. This leads to oceans of "dead space" of all 0s in the bit array, before the first & after the last relevant 1 bits. After a TMBitset is fully populated, calling shrink_to_fit will do what it says on the tin, leaving the bit array to refer only to the slice of the parent <item> array containing all the objects in our set.

A few benefits:

The master vertex array isn't 100% straightforward though. The vertices are separated into hi_priority_points followed by low_priority_points to get the best results from the Straightforward_intersection label simplification routine. This means our bitsets will also have a 3rd huge block of 0 bits in the middle too. No big deal -- some pretty simple tweaks to WaypointQuadtree::graph_points and HighwayGraph::simplify will let us have the best of both worlds, both simplifying the labels in the desired order and storing the HGVertex objects in the order their Waypoints come out of the quadtree. This will push the vast majority of this dead space out to the beginning & end of each bitset, improving the optimization potential of TMBItset::shrink_to_fit.

Between all the dead space at the beginning, middle & end, some quick-n-dirty coding & spreadsheeting suggests the regional sets are 95% dead space and the system sets are 89% dead space. So, good potential for improvement here. How much will this all actually help out in the end? Remains to be seen. The TMBitsets themselves are a small amount of the data accessed during graph generation.

Edges behave a little differently. First, we have the simple edges, created in system->route->segment order, not so closely related to the quadtree. Followed by all the collapsed/traveled edges, which are created in roughly quadtree order, based on the hidden vertex the edges are collapsed around. Gaps in the 2nd category will be taken care of by the vertex reordering tweaks mentioned above, but this still leaves a gap between the simple and collapsed/traveled edges. Again, no big deal. Edges are initially created in a std::list, then moved to a TMArray for permanent storage to allow TMBitset to work its magic. We could either use std::list::sort on the former or std::sort on the latter, whichever performs best. Use whatever sort criterion yields the most compact results; could be something as simple as lhs.vertex1 < rhs.vertex1.


Aside from this, I'll continue experimenting with other cache/memory locality/bandwidth ideas noted upthread & elsewhere on GitHub. From what I've seen so far however, effects are likely to be fairly minimal, probably lower impact than TMBitset::shrink_to_fit. Nonetheless, I'll keep exploring. On the CentOS front, lab2 took 3.07 s writing graph files at 5 threads, tantalizingly close to breaking the 3-second barrier. Unfortunately, I don't have much optimism left for FreeBSD.