Closed yakra closed 3 years ago
Multi-threaded vertex construction is implemented, but performance is a bit underwhelming; it doesn't scale super well. Efficiency peaks at 4 threads, then slows down. Pretty heavy use of mutexes here...
vertex_names
membership & insertions is the big one -- arrays of 256 sets & 256 mutexes indexed by the vertex name's final char allow threads to operate more independently.waypoint_naming_log
: Not much can be done here, but it's a quick move of a string&&
onto a list.Waypoint*
:HGVertex*
hash: ~Could work around it moderately easy but there's probably little to gain. Integers should hash pretty quick. :) Insert the vertex, and we're done. Enough time is probably spent elsewhere to not make this a very big deal.~ How about making a list<pair<Waypoint,HGVertex>>
per thread, and then inserting them all into the unordered_map once threads finish? AW YISS. Scales all the way to 16 threads on lab2 (lab3 results pending).branch | last commit | msg | comments |
---|---|---|---|
insertion | 9c297e8 | cwn2 | cherrypick into main branch manually recreate cherrypick into stash |
HGVertex (older) |
62765ec | set/mutex arrays | with original graph point lists instead of vectors cherrypick into main branch |
cwn2Asc | e680bdd | cwn2Asc | straightforward_concurrency component |
cwn2Aen | 4cbf3c9 | cwn2Aen | exit_number component |
HGVertex_scrap | 0e43c6d | minor CWN speedups | cherrypick -> 12a8ea8 |
HGVertex_old | 9f4794f | Python 3+ intersection bugfix |
final single-threaded revisions | BT | lab1 | lab1.5 | lab2 | lab3 |
---|---|---|---|---|---|
_master b155944c33c1b5b4470bf1b1b16a42a54fff79ea |
5.477 | 3.277 | 3.16 | 4.001 | 5.68 |
_insert | 5.255 | 3.166 | 3.115 | 3.893 | 5.664 |
_gp | 5.349 | 3.125 | 3.071 | 3.829 | 5.659 |
_func | 5.287 | 3.174 | 3.065 | 3.833 | 5.588 |
_cwn | 5.467 | 3.123 | 3.054 | 3.852 | 5.465 |
Wrong regex. Used 'unique|Creating edges'
; should have been 'Setting|Creating edges'
final single-threaded revisions
execs='_master _insert _gp _func _cwn';
echo -en '\n|| ';
echo $execs | sed 's~ ~ | ~g';
for e in foo $execs; do
echo -n '| -- ';
done;
echo '|';
for box in BT lab1 lab5 lab2 lab3; do
echo -en "| $box | ";
./table.sh $box sulogs/2021-02-17 'Setting|Creating edges' $execs
| tail -n 1 | sed "s~$t~ | ~g";
done 2>/dev/null
_master | _insert | _gp | _func | _cwn | |
---|---|---|---|---|---|
BT | 7.7440 | 7.5280 | 6.1650 | 6.0950 | 6.2740 |
lab1 | 4.4360 | 4.3330 | 3.6100 | 3.6490 | 3.6080 |
lab5 | 4.7700 | 4.7250 | 3.6560 | 3.6430 | 3.6350 |
lab2 | 5.3490 | 5.2440 | 4.4150 | 4.4080 | 4.4310 |
lab3 | 8.0290 | 8.1860 | 6.9630 | 6.7590 | 6.6380 |
squashed on BiggaTomato
exec suffix |
commit | msg | comments |
---|---|---|---|
_insert | 5b2d691 | vertex_names membership & insertions | • Avoid redundant membership checks + insertion. • Always attempt to insert. Save result; branch on whether successful. • Nix the good_to_go boolean. |
_gp | e5bf367 | simplification priority | Nix WaypointQuadtree::graph_points , waypoint_simplification_sort , waypoint_simplification_priority , and make separate unsorted lists of hi-priority & lo-priority points per TravelMapping#285. Build more efficiently: Skip devel systems entirely; remove redundant colcation checks. Python too. |
_func | 5e54c6c | separate simplify function | Refactor the actual loop that generates unique names and vertices, the one that got touched up 2 commits ago, into a separate function. |
_cwn | a765338 | CWN.new | • canonical_waypoint_name receives HighwayGraph* (C++) • remove extraneous colocation check • exit_number: remove extraneous list_entry_name() check, as name_no_abbrev() must match • reversed_border_labels: remove redundant variable • avoid unnecessary lookup of new straightforward_intersection name • Python too. |
_mtx | b11f25a | mutexes | Mutexes for vertex_names and waypoint_naming_log . |
_0l | de3f7e3 | threaded vertex construction | First attempt, with graph points interleaved across arrays of lists, similar to threaded WaypointQuadtree::sort. |
_0v | a153e68 | graph point vectors | Reduce mutex collisions for per-region/system vertex sets by storing graph points in 2 (hi/lo priority) vectors, and giving each thread a solid nth of the pie to munch on. |
_av | 6db7e64 | set/mutex arrays | Reduce mutex collisions for the vertex_names set with arrays of 256 sets & 256 mutexes indexed by the vertex name's final char. |
_wvp | 439604f | create w:v hash after threads finish | Reduce mutex collisions for the vertices hash table by storing pair<Waypoint*,HGVertex*> lists for each thread, and building hash table afterwards. |
_cwn2Aw | 6399f49 | cwn2Aw | strncmp instead of substr in label_references_route |
_cwn2ABC | 96920d9 | cwn2ABC | Replace substr calls in exit_number & straightforward_concurrency. |
_cb5be2f | cb5be2f | progress indicator fix | Before: *counter read/written by all threads with no mutex, resulting in too many dots. After: read/write only in thr[0] -- n times as frequently for 1/n of the data. |
_vmtP | 7937d3f | penultimate | delete[] wvp; terse Py array indexing |
_vmtU | accdcf6 | points->at(w) -> (*points)[w] | no need to check for exceptions |
ceaf43a | restore list_entry_name() check to CWN/exit_number | https://github.com/TravelMapping/DataProcessing/issues/285#issuecomment-787122430 | |
_en | 1ff7240 | exit_number CWN check for name_no_abbrev(##) | https://github.com/TravelMapping/DataProcessing/issues/285#issuecomment-787122430 |
_hgvR | 12a8ea8 | minor CWN speedups | cherrypick 0e43c6d3836b3df2a057271c5eae40ca0f9ce511 |
9f4794f | Python 3+ intersection bugfix |
#ifdef
s to remove the Building waypoint->vertex hash table.
step from siteupdateST. Considered; rejected. Actually slower!list_entry_name()
check removed from Exit_number
in a765338
was not actually extraneous, as we're matching an entire string & not just the beginning. Double-check that no results changed... Nothing changed. Still, should probably be put right back in for safety's sake, however rare its occurrence may actually be.
Route::match_with_or_without_abbrev
function or whatever
Where else can it be used? Where else do we match both list_entry_name()
and name_no_abbrev()
? Just here.ap_coloc[try_as_match]->label == ap_coloc[try_as_exit]->route->name_no_abbrev() + '(' + ap_coloc[try_as_exit]->label + ')'
NICE! Changes 256 more Keep_failsafe
cases to Exit_number
.Pending changes will alter the order waypoints are fed into canonical_waypoint_name(), resulting in diffs such as:
diff <(sort ST/logs/waypointsimplification.log) <(sort logs/waypointsimplification.log) | head -n 8
12639c12639
< Appended region: A10@2|PRT
---
> Appended region: A10@2|ESP-NC
12641c12641
< Appended region: A10@5|PRT
---
> Appended region: A10@5|ESP-NC
diff <(sed -r 's~(Appended region: .+)\|.+~\1~' ST/logs/waypointsimplification.log | sort) <(sed -r 's~(Appended region: .+)\|.+~\1~' logs/waypointsimplification.log | sort)
176144c176144
< Straightforward_intersection: A1Lon@B100&B100@A1 -> A1Lon/B100 (A1/B100 already taken)
---
> Straightforward_intersection: A1Lon@B100&B100@A1 -> A1/B100
176189c176189
< Straightforward_intersection: A1War@B100&B100@A1 -> A1/B100
---
> Straightforward_intersection: A1War@B100&B100@A1 -> A1War/B100 (A1/B100 already taken)
Certainly not a big deal in & of itself. Going forward, vertex names will be the same each time with siteupdateST & siteupdate.py. So far so good.
Multithreaded, points aren't processed in a deterministic order. There will be diffs such as these between successive runs.
A new option of siteupdate -v
or siteupdate --st-vertices
forces single-threaded waypoint simplification, resulting in the same vertex names each time. Useful for development.
@jteresco, what's more useful to you & your students? If preferred, I can easily make single-threaded waypoint simplification the default behavior, and change the --st-vertices
option to --mt-vertices
. The time difference is only about 1.7 s on BiggaTomato, lol. Sure to be even smaller on faster machines.
Catching up on some things. I think it's not that urgent to maintain consistency when one waypoint label gets simplified and the other cannot, but it seems like it's easy enough to do and probably should be done to avoid potential confusion.
it seems like it's easy enough to do and probably should be done to avoid potential confusion.
OK. I'll make single-threaded simplification the default, with -v
or --mt-vertices
optional.
Waypoint::devel()
vice w->route->system->devel()
hurt performance?~ Deferred. Keeping the (modified) original graph_points
function. Any changes to the internals of is_or_colocated_with_active_or_preview()
can happen later.I became too obsessed with achieving a squeaky-clean commit history, wound up with 4 different branches taking slightly different development paths toward the same final product, then got a bit burned out & distracted by other things for almost 2 months.
Time to remember where I was in the process, clean up, and pick up where I left off. I hope to strike a balance between... • a commit history that's easy to follow • a commit history that's clean (a little overlap though not the same) • maintaining my sanity
@jteresco wrote:
I think it's not that urgent to maintain consistency when one waypoint label gets simplified and the other cannot, but it seems like it's easy enough to do and probably should be done to avoid potential confusion.
@yakra wrote:
OK. I'll make single-threaded simplification the default, with
-v
or--mt-vertices
optional.
This is also the most convenient for me from a development standpoint. Multi-threaded waypointsimplification.log files aren't DIFFable without some heavy preprocessing. Single-threaded files can be compared more easily.
To filter out the diffs caused by multithreading:
diff \
<(sed -r \
-e 's~\|.*~~' \
-e 's~.*: ~~' \
-e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1\2(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\2\7 -> \1\7/\3\5~' \
-e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\7 -> \1\7/\3\5~' \
ST/logs/waypointsimplification.log \
| sort) \
<(sed -r \
-e 's~\|.*~~' \
-e 's~.*: ~~' \
-e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1\2(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\2\7 -> \1\7/\3\5~' \
-e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\7 -> \1\7/\3\5~' \
logs/waypointsimplification.log \
| sort)
This is another one of those cases where too big of a mutex footprint kills performance.
71c5200
was the first attempt at multithreading right out the gate.7bae030
reduces competition for the set of unique vertex names by breaking it up into an array of 256 sets of unique names with 256 mutexes , indexed by the name's final character.311e2d0
removes the need for a mutex in building the waypoint->vertex hash table. Instead of adding to the table during threaded operation, we keep numthreads lists of <waypoint,vertex> pairs, and insert them when finished with the threaded bits.
Single-threaded, it takes less wall time when we don't have all those threads fighting over the same mutex. The time saved is greater than the cost of building the hash table as its own task + the overhead of building the temporary structures we need.71c5200
. I guess that without all the threads spending a lot of time hung up in the hash table mutex, they have a lot more opportunity to fight over the one mutex left over, leading to performance degradation.Two more possibilities for reducing the impact of mutexes:
vertex_names[256]
set array, e.g. middle character.
@
, &
and /
. Pro: more array indices. Con: There's potential to be a lot. Outweighing other characters' frequency?How much impact from...
^ Re the 2nd bullet point, ignoring the 1st:
Another CPU vs RAM trade-off, like what we saw in #404 only much less pronounced.
Alternatives that perform better with fewer threads perform worse with more, and vice versa.
5d22a4
(the darker solid lines) has some added cost in creating temporary containers and then iterating through them.
This is more than made up for by not having to lock and unlock mutexes all the time, which comes with its own computational costs even when there's nothing to lock, no risk of collisions. Hence the decent speedup even at 1 thread.
However, building these lists is more memory intensive. This is fine with relatively few threads needing bandwidth, but once we add in a load of threads and start saturating the memory bus, performance starts to lag a wee bit.
To implement, or not?
Thinking about multi-threading the process of Creating unique names and vertices...