yakra commented 6 months ago

Closes #626. This reverts & re-does #623. The slowdown is gone, replaced with the speedup it really should have been.

Like before, sprintf has been removed from user log, DB file, and graph generation routines. (And now, routedatastats.log & stats CSVs too.) This time, it's been replaced with calls to the {fmt} library instead of using iostreams for double formatting.

As with #622, I recommend downloading and doing a compile & test drive before merging:

to make sure {fmt} is working on Mac
OK, clang is working again on noreaster. See how fast graph generation is, for real this time!
O HAY! How fast is --errorcheck now? :grin:

As noted before, we're nearing the end of the road for big graph generation improvements. Percentagewise, this is in line with the first big five pull requests from 2019. It's about middle-of-the-pack, doesn't beat all of them hands-down. (#623 did, but only in the unrealistic scenario of writing to invalid files, e.g. /dev/null.) While I still have a few ideas left, there probably won't be anything else of this magnitude. Just smaller incremental changes, much lower impact. Although, now that we know formatting numbers is a hotspot, we can add a couple more items to that list:

Experiment with fmt::format_int and {fmt}'s std::ofstream support. FreeBSD seems especially hampered by formatting numbers (at least, with doubles); this may help.
Reduce redundancy (including formatting) when writing tmg edge lines.

yakra commented 6 months ago

userlogs

Here are links to charts of how the last three performance improvements played out. For FreeBSD in particular:

526 helped out quite a bit at the high end, though a considerable bottleneck still remained.

592 brought the low end way down, only to run right back into the same bottleneck.

607 didn't have much effect.

disk-User Here's an alternate chart of performance writing to /dev/null, if you want to compare to the old chart at https://github.com/TravelMapping/DataProcessing/pull/623#issuecomment-2119357092. Yes, this takes longer now, but /dev/null is not a real-world scenario. Lies, damn lies, and benchmarks! ;)

This finally brings FreeBSD's time under control across the board, even outperforming both Linux flavors on the same hardware.

yakra commented 6 months ago

database

Table speed was measured using siteupdateST. Multi-threaded siteupdate doesn't print timestamps, as the DB is written in the background during graph generation. Sure, this will make the DB take a bit longer as it competes for resources like RAM & disk access, but it's a big win when we get graphs done at the same time.

routes disk-routes Alternate charts of writing to /dev/null:

0.16s vertical scale, to compare to the chart at https://github.com/TravelMapping/DataProcessing/pull/623#issuecomment-2119357810 The charts at #623 didn't give a realistic picture. The time spent formatting floating-point numbers was removed and replaced with... nothing. Not the new method of formatting, just nothing. This image does show the time spent formatting the numbers, as that's separate from the filestream operation; it happens whether we're writing to a valid file or not.
0.28s vertical scale. Compare to the chart above and you'll see that the before & after times differ by the same constant amount in each. /dev/null is not a full picture. A lot of stuff just doesn't happen with no valid file... int formatting, filestream operations, what-have-you, and it throws the percentages out of whack.

These patterns will more or less hold with the rest of the tables too.

connectedRoutes disk-connectedRoutes /dev/null flavored: 0.15s & 0.24s

waypoints disk-waypoints /dev/null: 3.2 & 4.0 Still a decent speedup this time; we're down >50%. Same with the remaining individual tables.

overallMileageByRegion disk-overallMileageByRegion /dev/null: 1.1 & 1.2 Note the Y axis in milliseconds, and that lab1 & lab5 showed up as writing to disk faster than /dev/null. With a task this lightning-fast, any noise in the measurements is going to be a big piece of the pie. Still a very good speed-up percentage-wise, not too far behind the lies & damn lies in https://github.com/TravelMapping/DataProcessing/pull/623#issuecomment-2119357810.

systemMileageByRegion disk-systemMileageByRegion Milliseconds again. /dev/null flavored: noise in the data again for lab4's "after" measurement, which appears faster than #623's version which didn't do any float formatting at all.

clinchedOverallMileageByRegion disk-clinchedOverallMileageByRegion /dev/null: 0.045 & 0.050

clinchedSystemMileageByRegion disk-clinchedSystemMileageByRegion /dev/null: 0.09 & 0.11

clinchedConnectedRoutes disk-clinchedConnectedRoutes /dev/null: 0.55 & 0.90

clinchedRoutes disk-clinchedRoutes /dev/null: 0.65 & 1.00 BiggaTomato is the one exception to "after" times <50% of "before" times.

entire .sql file disk-db /dev/null: 6.4 & 13.5 Total time for all tables, including those affected by these changes.

yakra commented 6 months ago

graph generation

Again, note that the .sql file is written in the background during graph generation. This means that while n threads are dedicated to graph generation, the computer is running n+1 threads overall, at least until the DB is finished or the subgraph threads start returning, whichever happens first. Speeding up this background task gives us more resources for graph generation, which of course itself is also sped up by sprintf's removal.

disk-Graphs_3282-2_112 /dev/null flavored: 110 & 112. The former is most comparable to this image from #622 here and this image from #623 here. FreeBSD's times are finally starting to get under control; only the "before" line towers above all the rest like Mount Katahdin, although the "after" line is still considerably slower than the others. At least it scales up to a large number of threads, if slowly.

disk-Graphs_3282-2_30

Zoomed in. Writing to /dev/null (25 & 30), we see the big performance gap between Linux & FreeBSD disappear, with bsdlab even outperforming lab3 (CentOS7) and lab4 (Ubuntu 18.04) on the same hardware. This shows us that whatever's happening to makes FreeBSD happen so slow happens when working with ofstreams. Maybe the potential improvements listed in the OP could help bring the times down some more.

yakra commented 6 months ago

alternate universe

With the merge commit in the history, 69545d1 can show us...

What if `sprintf` were removed first, instead of doing #622?

disk-Graphs_6954-1_112 Again, only bsdlab's "before" line is way above the rest. The removal of sprintf was the magic bullet we needed to get the task to scale well.

/dev/null flavored: 110 & 112. The former is most comparable to this image from #622 here.

disk-Graphs_6954-1_35 Zoomed in. /dev/null flavored: 25 & 35. The former is most comparable to this image from #622 here and this image from #623 here. A much bigger impact than #622 on its own. FreeBSD is already performing better than Linux. It'd appear that applying this change to #622 has a bigger impact than applying it to fd96ac6. ...Most of the time.

And if #622 had been applied after this?

disk-Graphs_6954-2_28 This gives a clearer picture of the benefits of using TMBitsets under more optimal circumstances. Especially with FreeBSD, which actually has a meaningful delta now.

/dev/null flavored:

28s scale for comparison against the writing-to-disk image above
25s scale for comparison against...
- this image from this post in #622
- this image from this post in #623 (applying the sprintf->ofstream conversion on top of TMBitsets)
- this image from this post in #623 (applying the sprintf->ofstream conversion to the pre-TMBitset state)
- this image from this post in #623 (applying TMBitsets on top of the sprintf->ofstream conversion)

Meh, I guess it would be more useful to provide side-by-side comparisons for performance writing to disk, instead of that /dev/null rubbish.

112s, full-scale with old FreeBSD in the picture:
35s, zoomed in:

yakra commented 6 months ago

That does it for redoing #623. A couple more tasks that weren't part of #623 have been converted from sprintf to {fmt}.

routedatastats.log

disk-rdST /dev/null flavored makes it look like FreeBSD loses a lot of efficiency when properly writing files, but 0b29 had a couple outliers of 0.1990s & 0.2112 s, while the other 8 runs measured 0.1314s or lower.

This task takes longer with threaded siteupdate than siteupdateST, even though it only uses a single thread in either case. The code is the same in both versions. disk-rdMT /dev/null flavored Could it be that the overhead of managing threads in one task (computing stats) causes there to be more cache misses in the next?

yakra commented 6 months ago

stats CSVs

Stats CSVs have a few extra factors at play, and deserve a slightly different look.

Before, These try/catch blocks slowed down CSV generation far more than sprintf did. In the very frequent cases where a traveler has no mileage in a given region, std::unordered_map::at would throw an exception, create a std::out_of_range object, unwind the stack, and whatever else. Very expensive. Replacing the try/catch blocks with std::unordered_map::find followed by an if/else accounted for IIRC about 93% of the time savings on BiggaTomato. Moral of the story is, throwing exceptions should be the exception, not the rule.

This led to allbyregionactivepreview.csv alone taking longer than all per-system CSVs combined. disk-CsvST3282 (Except for on BiggaTomato, where it took 99.65% as long. Close enough!) Since we have to wait for allbyregionactivepreview.csv no matter what, and using >1 thread for per-system CSVs won't help us finish up any sooner, I took a different approach here and capped the number of threads at 3:

allbyregionactiveonly.csv
allbyregionactivepreview.csv
all per-system CSVs

At 2+ threads, the allbyregion* ones called StatsCsvThread when finished, to divide & conquer (what's left of) the per-system files.

Due to a bug in a benchmarking script, I thought all but the fastest machines performed worse multithreaded (only lab3 & 4 did), so I made multi-threaded CSVs a commandline option, with single-threaded the default behavior. Ha. Garbage in, gospel out.

Now, allbyregionactiveonly & allbyregionactivepreview take a fraction of the time that all the per-system files do. disk-CsvST0b29_1 40 (same thing, zoomed in)

With this in mind, it makes sense to use >3 threads now, so that restriction has been lifted. The task scales up more gracefully, so I've made multi-threaded the default & single-threaded optional.

Multi-threaded

disk-CsvMT_12 00 The "before" lines level off at 3 threads due to the 3-thread cap, regardless of what Args::numthreads is. Bsdlab really suffers from sprintf and throwing exceptions (haven't looked into which had the bigger effect). Its line is less "level" and more jagged; when bsdlab really slows down, there's always a lot of noise & unpredictability in the data. Same but to a lesser extent for lab4, on the same hardware. It took bsdlab significantly longer to write to /dev/null? I have no explanation for that.

disk-CsvMT_2 20 Zoomed in. I like how the "after" lines are all squished down at the bottom so we can't see any detail. That's a Good Thing. :grin: (What's lab2 doing? I'll get to that in a bit.) Toma does better at 2 threads, then worse at 3. Could be the transition from physical to logical cores. Lab3 & lab4, with their slower clock & RAM speed, and BiggaTomato, with a higher clock but slower RAM & older architecture, lag behind. What's up with the big spike from 1 to 2 threads? Don't know. I'm skeptical that creating 2 std::thread objects, interfacing with libstdc++.so & whatever goes on under the hood could take up that much time. Besides, 0b2982b doesn't have that problem. I'll chalk it up to noise in the data. Writing to /dev/null doesn't make a big difference here; we're spending most of our time throwing exceptions.

disk-CsvMT_0 23 Zoomed in to just the new version, to see how it scales. In theory, processing time should bottom out at 7 or 8 threads depending on machine, based on allbyregionactivepreview.csv's share of the total time. Toss enough threads at the system files, and more don't help; we still have to wait for allbyregionactivepreview. What we see beyond those 7 or 8 threads should just be noise, or a task gradually becoming less efficient as it takes on more threads than its "sweet spot", like we see sometimes. This is visible with lab3 & lab4, and bsdlab if we look at /dev/null.

What's lab2 doing? Looks like disk access is the bottleneck, and it's experiencing the same symptom that made me start using /dev/null for benchmarks a few years ago. A few different runs of the same (or mostly the same) commit & we get widely different results: disk-lab2

Cherrypick the nicest-looking line, plop it into the previous chart, and it still lags behind the other machines. In this case, the /dev/null results could in some sense be considered more meaningful. Speaking of which, what's with lab4's weird bump up from 7-10 threads? No idea. But it was consistent through 3 separate runs. Lab5 (both run Ubuntu) does the same thing if I give it that many threads, even if the hardware only supports 8. :shrug:

Edit: Meant to add one more image -- using a logarithmic scale lets us fit everything on the chart, visibly & well-separated: disk-CsvMT_logXY

yakra commented 6 months ago

Ready to rock, with the caveat that I'd encourage downloading, compiling & running on the Mac to make sure {fmt} & everything else check out.

jteresco commented 6 months ago

I can compile and run on my Mac, but I had to change the CXXFLAGS line to include the place that MacPorts installed libfmt.

CXXFLAGS = -O3 -std=c++11 -isystem /usr/local/include -isystem /opt/local/include/libfmt10 -Wno-comment -Wno-dangling-else -Wno-logical-op-parentheses

jteresco commented 6 months ago

Full production site update in 24.1 seconds. Was 112.0 yesterday. Nice!

yakra commented 6 months ago

...with graph generation at 11.9 s. Not bad! FreeBSD's graph gen times have finally come down out of the stratosphere. I'd still love to get it more on par with Linux in the future, though :) but I'm satisfied for now, and can step away from it for a bit.

Edited https://github.com/TravelMapping/DataProcessing/pull/628#issuecomment-2148308135 to add a logarithmic chart of all machines, before & after. Meant to do that earlier. I hope to never make a pull request with this many images again. :)

Deleted branches on BiggaTomato:

44f7bac fmt1Atest Merged 69545d1 into 3282921 bypassing the ofstream history on TM:master, #623 & #625. No diff from f337849
896ca4a fmt1 Merged 9076682 into 3282921 bypassing yadda yadda. 9076682 is the same as 69545d1, except:
- Multi-threaded CSVs mandatory, instead of an option to disable. -C gone completely.
- Bug: {:.2} instead of {:.2f} in all 4 lines of all 3 CSV functions.

rickmastfan67 commented 6 months ago

Does this now mean I can b-slap you now @yakra till you do the new AR US-78 file? :P Just had to say it for fun. :)

yakra commented 6 months ago

That's certainly been on my mind as work on this dragged out. Honestly, I was curious if I'd see a pull request from someone else getting tired of waiting. :rofl:

That'll be my no.1 priority on getting back to the forum; I know that's a big one. (My no.1 priority overall right now is getting to the 24-hr laundromat. After that, sweating my head off in a restaurant kitchen. And then...)

Oy vey. I have so many browser tabs open. Firefox & Chromium here on the desktop, more Firefox on the laptop... And I'm sure there are many new topics too.

yakra commented 6 months ago

@rickmastfan67 I know you posted somewhere here on GitHub about US78. And I was like, "Noted." Sure, finding the TM & AARoads forum topics was easy enough, but I can't for the life of me find the GitHub thread, and it's driving me crazy. Any ideas?

Edit: Found it. https://github.com/TravelMapping/HighwayData/pull/7326#issuecomment-2048735379

rickmastfan67 commented 6 months ago

Must feel like 3000 years ago, eh @yakra? LOL!

TravelMapping / DataProcessing

{fmt}: #585 phase 1 take 2 #628

userlogs

526 helped out quite a bit at the high end, though a considerable bottleneck still remained.

592 brought the low end way down, only to run right back into the same bottleneck.

607 didn't have much effect.

database

graph generation

alternate universe

What if `sprintf` were removed first, instead of doing #622?

And if #622 had been applied after this?

routedatastats.log

stats CSVs

TravelMapping / DataProcessing

{fmt}: #585 phase 1 take 2 #628

userlogs

526 helped out quite a bit at the high end, though a considerable bottleneck still remained.

592 brought the low end way down, only to run right back into the same bottleneck.

607 didn't have much effect.

database

graph generation

alternate universe

What if sprintf were removed first, instead of doing #622?

And if #622 had been applied after this?

routedatastats.log

stats CSVs

What if `sprintf` were removed first, instead of doing #622?