yakra commented 4 years ago

Thinking about multi-threading the process of Creating unique names and vertices...

One single list of graph points sorted by name priority, as done since #268, is not sufficient.
With multiple threads iterating through this list across the boundary from high- to low-priority points, it's possible for a low priority point to get processed first. There's potential, however small, for wacky antics.
We're better off with 2 separate lists of points, which can be processed in 2 separate multi-threaded passes.
...Which is faster anyway. All we need is for the high priority points to be done before the low priority points; we don't care about the order within these 2 categories. Sorting everything, as done now, is overkill. We do need to get the priority value (if I really wanna get jiggy wit it, I can order the criteria from most to least frequently encountered, and save... probably an imperceptible amount of time), but don't need to spend time comparing tons of point pairs and performing the actual sort.
- To that end, if future changes to waypoint simplification ever require >3 priority levels, expanding on the binary predicate required by a C++ sort would be a pointless PitA compared to an integer function as implemented in Python.
This improvement can also speed up the Python version.
It also opens the door (but not very wide?) to a workable solution to #211

yakra commented 3 years ago

Multi-threaded vertex construction is implemented, but performance is a bit underwhelming; it doesn't scale super well. Efficiency peaks at 4 threads, then slows down. Pretty heavy use of mutexes here...

[x] vertex_names membership & insertions is the big one -- arrays of 256 sets & 256 mutexes indexed by the vertex name's final char allow threads to operate more independently.
waypoint_naming_log: Not much can be done here, but it's a quick move of a string&& onto a list.
[x] Waypoint*:HGVertex* hash: ~Could work around it moderately easy but there's probably little to gain. Integers should hash pretty quick. :) Insert the vertex, and we're done. Enough time is probably spent elsewhere to not make this a very big deal.~ How about making a list<pair<Waypoint,HGVertex>> per thread, and then inserting them all into the unordered_map once threads finish? AW YISS. Scales all the way to 16 threads on lab2 (lab3 results pending).
[x] Inserting vertices into per-region & per-system sets has the best potential for improvement. The way waypoints are interleaved across the threads, much of the time all threads will be working on the same region and same system, creating lots of opportunity for collisions. Store points in vectors instead of lists, and give each thread a solid nth of the pie to munch on. Edit: A very slight improvement. Also underwhelming.

yakra commented 3 years ago

dead branches on BiggaTomato

branch	last commit	msg	comments
insertion	9c297e8	cwn2	cherrypick into main branch manually recreate cherrypick into stash
HGVertex (older)	62765ec	set/mutex arrays	with original graph point lists instead of vectors cherrypick into main branch
cwn2Asc	e680bdd	cwn2Asc	straightforward_concurrency component
cwn2Aen	4cbf3c9	cwn2Aen	exit_number component
HGVertex_scrap	0e43c6d	minor CWN speedups	cherrypick -> 12a8ea8
HGVertex_old	9f4794f	Python 3+ intersection bugfix

yakra commented 3 years ago

[x] Don't forget to fix the progress indicator

yakra commented 3 years ago

final single-threaded revisions	BT	lab1	lab1.5	lab2	lab3
_master b155944c33c1b5b4470bf1b1b16a42a54fff79ea	5.477	3.277	3.16	4.001	5.68
_insert	5.255	3.166	3.115	3.893	5.664
_gp	5.349	3.125	3.071	3.829	5.659
_func	5.287	3.174	3.065	3.833	5.588
_cwn	5.467	3.123	3.054	3.852	5.465

Nope.

Wrong regex. Used 'unique|Creating edges'; should have been 'Setting|Creating edges'

yakra commented 3 years ago

final single-threaded revisions

execs='_master _insert _gp _func _cwn';
echo -en '\n|| ';
echo $execs | sed 's~ ~ | ~g';
for e in foo $execs; do
 echo -n '| -- ';
done;
echo '|';
for box in BT lab1 lab5 lab2 lab3; do
  echo -en "| $box | ";
  ./table.sh $box sulogs/2021-02-17 'Setting|Creating edges' $execs
  | tail -n 1 | sed "s~$t~ | ~g";
done 2>/dev/null

_master	_insert	_gp	_func	_cwn
BT	7.7440	7.5280	6.1650	6.0950	6.2740
lab1	4.4360	4.3330	3.6100	3.6490	3.6080
lab5	4.7700	4.7250	3.6560	3.6430	3.6350
lab2	5.3490	5.2440	4.4150	4.4080	4.4310
lab3	8.0290	8.1860	6.9630	6.7590	6.6380

yakra commented 3 years ago

squashed on BiggaTomato exec suffix	commit	msg	comments
_insert	5b2d691	vertex_names membership & insertions	• Avoid redundant membership checks + insertion. • Always attempt to insert. Save result; branch on whether successful. • Nix the `good_to_go` boolean.
_gp	e5bf367	simplification priority	Nix `WaypointQuadtree::graph_points`, `waypoint_simplification_sort`, `waypoint_simplification_priority`, and make separate unsorted lists of hi-priority & lo-priority points per TravelMapping#285. Build more efficiently: Skip devel systems entirely; remove redundant colcation checks. Python too.
_func	5e54c6c	separate simplify function	Refactor the actual loop that generates unique names and vertices, the one that got touched up 2 commits ago, into a separate function.
_cwn	a765338	CWN.new	• canonical_waypoint_name receives HighwayGraph* (C++) • remove extraneous colocation check • exit_number: remove extraneous list_entry_name() check, as name_no_abbrev() must match • reversed_border_labels: remove redundant variable • avoid unnecessary lookup of new straightforward_intersection name • Python too.
_mtx	b11f25a	mutexes	Mutexes for `vertex_names` and `waypoint_naming_log`.
_0l	de3f7e3	threaded vertex construction	First attempt, with graph points interleaved across arrays of lists, similar to threaded WaypointQuadtree::sort.
_0v	a153e68	graph point vectors	Reduce mutex collisions for per-region/system vertex sets by storing graph points in 2 (hi/lo priority) vectors, and giving each thread a solid nth of the pie to munch on.
_av	6db7e64	set/mutex arrays	Reduce mutex collisions for the `vertex_names` set with arrays of 256 sets & 256 mutexes indexed by the vertex name's final char.
_wvp	439604f	create w:v hash after threads finish	Reduce mutex collisions for the `vertices` hash table by storing `pair<Waypoint,HGVertex>` lists for each thread, and building hash table afterwards.
_cwn2Aw	6399f49	cwn2Aw	strncmp instead of substr in label_references_route
_cwn2ABC	96920d9	cwn2ABC	Replace substr calls in exit_number & straightforward_concurrency.
_cb5be2f	cb5be2f	progress indicator fix	Before: *counter read/written by all threads with no mutex, resulting in too many dots. After: read/write only in thr[0] -- n times as frequently for 1/n of the data.
_vmtP	7937d3f	penultimate	delete[] wvp; terse Py array indexing
_vmtU	accdcf6	points->at(w) -> (*points)[w]	no need to check for exceptions
	ceaf43a	restore list_entry_name() check to CWN/exit_number	https://github.com/TravelMapping/DataProcessing/issues/285#issuecomment-787122430
_en	1ff7240	exit_number CWN check for name_no_abbrev(##)	https://github.com/TravelMapping/DataProcessing/issues/285#issuecomment-787122430
_hgvR	12a8ea8	minor CWN speedups	cherrypick 0e43c6d3836b3df2a057271c5eae40ca0f9ce511
	9f4794f	Python 3+ intersection bugfix

yakra commented 3 years ago

[x] Consider using #ifdefs to remove the Building waypoint->vertex hash table. step from siteupdateST. Considered; rejected. Actually slower!
The list_entry_name() check removed from Exit_number in a765338 was not actually extraneous, as we're matching an entire string & not just the beginning. Double-check that no results changed... Nothing changed. Still, should probably be put right back in for safety's sake, however rare its occurrence may actually be.
- [x] Python: Just put the check back in, like before. The costs of string interning are already paid upfront, so no need to get fancy.
- [x] C++: Maybe a new Route::match_with_or_without_abbrev function or whatever Where else can it be used? Where else do we match both list_entry_name() and name_no_abbrev()? Just here.
[x] Exit_number: new check for ap_coloc[try_as_match]->label == ap_coloc[try_as_exit]->route->name_no_abbrev() + '(' + ap_coloc[try_as_exit]->label + ')' NICE! Changes 256 more Keep_failsafe cases to Exit_number.

yakra commented 3 years ago

Pending changes will alter the order waypoints are fed into canonical_waypoint_name(), resulting in diffs such as:

diff <(sort ST/logs/waypointsimplification.log) <(sort logs/waypointsimplification.log) | head -n 8

12639c12639
< Appended region: A10@2|PRT
---
> Appended region: A10@2|ESP-NC
12641c12641
< Appended region: A10@5|PRT
---
> Appended region: A10@5|ESP-NC

diff <(sed -r 's~(Appended region: .+)\|.+~\1~' ST/logs/waypointsimplification.log | sort) <(sed -r 's~(Appended region: .+)\|.+~\1~' logs/waypointsimplification.log | sort)

176144c176144
< Straightforward_intersection: A1Lon@B100&B100@A1 -> A1Lon/B100 (A1/B100 already taken)
---
> Straightforward_intersection: A1Lon@B100&B100@A1 -> A1/B100
176189c176189
< Straightforward_intersection: A1War@B100&B100@A1 -> A1/B100
---
> Straightforward_intersection: A1War@B100&B100@A1 -> A1War/B100 (A1/B100 already taken)

Certainly not a big deal in & of itself. Going forward, vertex names will be the same each time with siteupdateST & siteupdate.py. So far so good.

Multithreaded, points aren't processed in a deterministic order. There will be diffs such as these between successive runs. A new option of siteupdate -v or siteupdate --st-vertices forces single-threaded waypoint simplification, resulting in the same vertex names each time. Useful for development.

@jteresco, what's more useful to you & your students? If preferred, I can easily make single-threaded waypoint simplification the default behavior, and change the --st-vertices option to --mt-vertices. The time difference is only about 1.7 s on BiggaTomato, lol. Sure to be even smaller on faster machines.

jteresco commented 3 years ago

Catching up on some things. I think it's not that urgent to maintain consistency when one waypoint label gets simplified and the other cannot, but it seems like it's easy enough to do and probably should be done to avoid potential confusion.

yakra commented 3 years ago

it seems like it's easy enough to do and probably should be done to avoid potential confusion.

OK. I'll make single-threaded simplification the default, with -v or --mt-vertices optional.

[x] Meanwhile, I've stumbled across a faster way to iterate thru the initial single-threaded bit, which in turn orders the data so that the multi-threaded bit runs faster (less competition for system & region mutexes when creating their vertex sets).
[x] ~...Mutexes we don't need to use. These sets can be created in the same single-threaded loop that creates the waypoint->vertex hash table. The rewrite continues...~ Herp derp! These mutexes are useful -- there's >1 of them; 'sfaster to multithread.
~Would Waypoint::devel() vice w->route->system->devel() hurt performance?~ Deferred. Keeping the (modified) original graph_points function. Any changes to the internals of is_or_colocated_with_active_or_preview() can happen later.

yakra commented 3 years ago

I became too obsessed with achieving a squeaky-clean commit history, wound up with 4 different branches taking slightly different development paths toward the same final product, then got a bit burned out & distracted by other things for almost 2 months.

Time to remember where I was in the process, clean up, and pick up where I left off. I hope to strike a balance between... • a commit history that's easy to follow • a commit history that's clean (a little overlap though not the same) • maintaining my sanity

yakra commented 3 years ago

@jteresco wrote:

I think it's not that urgent to maintain consistency when one waypoint label gets simplified and the other cannot, but it seems like it's easy enough to do and probably should be done to avoid potential confusion.

@yakra wrote:

OK. I'll make single-threaded simplification the default, with -v or --mt-vertices optional.

This is also the most convenient for me from a development standpoint. Multi-threaded waypointsimplification.log files aren't DIFFable without some heavy preprocessing. Single-threaded files can be compared more easily.

To filter out the diffs caused by multithreading:

diff \
<(sed -r \
  -e 's~\|.*~~' \
  -e 's~.*: ~~' \
  -e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1\2(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\2\7 -> \1\7/\3\5~' \
  -e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\7 -> \1\7/\3\5~' \
    ST/logs/waypointsimplification.log \
  | sort) \
<(sed -r \
  -e 's~\|.*~~' \
  -e 's~.*: ~~' \
  -e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1\2(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\2\7 -> \1\7/\3\5~' \
  -e 's~^(.+)([A-Za-z]{0,3})@(.+)([A-Za-z]{0,3})(_?[A-Za-z]{0,4})&\3([A-Za-z]{0,3})@\1(_?[A-Za-z]{0,4}) -> \1\2\7/\3\6\5 \(\1\7/\3\5 already taken\)$~\1\2@\3\4\5\&\3\6@\1\7 -> \1\7/\3\5~' \
    logs/waypointsimplification.log \
  | sort)

yakra commented 3 years ago

Mutexes and Scaling

Vert This is another one of those cases where too big of a mutex footprint kills performance.

71c5200 was the first attempt at multithreading right out the gate.
7bae030 reduces competition for the set of unique vertex names by breaking it up into an array of 256 sets of unique names with 256 mutexes , indexed by the name's final character.
311e2d0 removes the need for a mutex in building the waypoint->vertex hash table. Instead of adding to the table during threaded operation, we keep numthreads lists of <waypoint,vertex> pairs, and insert them when finished with the threaded bits. Single-threaded, it takes less wall time when we don't have all those threads fighting over the same mutex. The time saved is greater than the cost of building the hash table as its own task + the overhead of building the temporary structures we need.
The dashed lines at the very bottom represent the time building the waypoint->vertex hash table, the real "zero" for the lines above to asymptotically approach.
Not shown: interestingly, implementing the 2nd patch (post facto hash table construction) alone without the first patch (256 vertex name sets & mutexes) actually performs worse than 71c5200. I guess that without all the threads spending a lot of time hung up in the hash table mutex, they have a lot more opportunity to fight over the one mutex left over, leading to performance degradation.
An even better idea may be to get rid of the hash table altogether.

yakra commented 3 years ago

Two more possibilities for reducing the impact of mutexes:

Other alternatives for indexing the vertex_names[256] set array, e.g. middle character.
- Final char was assumed to be more efficient than first char, due to more diversity of final chars. But distribution will also be a factor.
- Middle char introduces the possibility of @, & and /. Pro: more array indices. Con: There's potential to be a lot. Outweighing other characters' frequency?
Eliminate per-region/system mutexes for vertex set insertions by writing to an array of ~forward_list~vectors for each region/system. Process these lists & insert into the sets afterwards.
- This post-processing ~can even itself~ should also be multi-threaded. Here, running multi-threaded with the occasional mutex conflict is faster than just running single-threaded.

How much impact from...

either option on its own?
both options together?

yakra commented 3 years ago

^ Re the 2nd bullet point, ignoring the 1st: Another CPU vs RAM trade-off, like what we saw in #404 only much less pronounced. Alternatives that perform better with fewer threads perform worse with more, and vice versa. Vert_set 5d22a4 (the darker solid lines) has some added cost in creating temporary containers and then iterating through them. This is more than made up for by not having to lock and unlock mutexes all the time, which comes with its own computational costs even when there's nothing to lock, no risk of collisions. Hence the decent speedup even at 1 thread. However, building these lists is more memory intensive. This is fine with relatively few threads needing bandwidth, but once we add in a load of threads and start saturating the memory bus, performance starts to lag a wee bit.

To implement, or not?

If I want fastest possible speed for siteupdate overall, I want to target the max # threads (right where no-build catches up) on lab2.
...which is also my production server. Production environment, memory will be at even more of a premium; with plenty of other processes needing bandwidth. I think this is what slows things down on noreaster.
69 more lines of bloaty code for minimal improvement.
I might just set this commit aside for a rainy day.
LOL what would happen to lab3's data if I replaced its ram with 2 sticks of 1066? Almost nothing, that's what. My memory bandwidth theory is bunk!

TravelMapping / DataProcessing

waypoint simplification sorting #285

dead branches on BiggaTomato

Nope.

Mutexes and Scaling