add integrated CI testing vs geographiclib C++ implementation

stonylohr commented 4 years ago

Suggested by @michaelkirk, along with a few starting thoughts... Try to integregrate @michaelkirk 's validation tool (https://github.com/michaelkirk/validate_geographiclib) into https://github.com/georust/geographiclib-rs CI. There are lots of different approaches to performance work, and I’m not an expert but the tools I found most helpful were:

https://crates.io/crates/criterion for performance regression testing and A/B testing
https://github.com/flamegraph-rs/flamegraph for profiling

Further thoughts from @stonylohr... While @michaelkirk didn't specify, I'm currently interpreting his suggestion to include both validation testing and performance testing. I don't currently interpret the suggestion to specifically endorse unit tests, integration tests, or a mix of the two (or something else) for validation testing of this sort. I read his use of the term "integrate" as being a casual use rather than referring specifically to integration testing. My default plan in this regard will be to play around with things and use whatever mix seems to come together most easily.

stonylohr commented 4 years ago

I plan to start on this task next, because I think work on any of the issues I submitted earlier would benefit from expanded validation and benchmark tests, to better assess their effect.

Additional thoughts or suggestions on this topic are welcome, but I also feel like I've got enough to run with as-is.

michaelkirk commented 4 years ago

Hi @stonylohr - I agree that it would be nice to get familiar with the validation process before making the other changes, to be more confident we're not breaking anything.

And thanks for opening the other issues - I'll take a look through them.

michaelkirk commented 4 years ago

All the issues you've proposed make sense to me. Please let me know if you have questions about validating against the reference data.

stonylohr commented 4 years ago

I've done an initial pass through the validation code tool, data, etc. I see a few different approaches I could take, and I'd like to confirm your preferences.

My current best guess is that your suggestion calls for modifying geographiclib-rs's CI to import validate_geographiclib, and call it directly. By default, I'll try this first.

Another option would be to use validate_geographiclib as a template for testing, but make GeodSolve calls directly, rather than actually calling or referencing validate_geographiclib.

A middle path might involve importing some of the types and utilities from validate_geographiclib, but making calls to GeodSolve directly.

Going further afield, we could also find or create tests for the various geographiclib ports (starting with C++ and rust) that log results and times in the same format, and compare results that way. Or further afield yet, write a set of C wrappers for various geographiclib C++ functions, and call those via rust's foreign function interface. Both of these approaches would likely allow access to a wider variety of functionality than is available through GeodSolve, and would likely involve less I/O overhead than going through GeodSolve. Despite these advantages, it feels like they would require a deeper dive into C and C++ code than I'm interested in at the moment. But I figured I'd mention them for completeness.

michaelkirk commented 4 years ago

Going further afield, we could also find or create tests for the various geographiclib ports (starting with C++ and rust) that log results and times in the same format, and compare results that way.

This is sort of the route I went initially. See the bin/short-check and bin/long-check scripts which runs the same test-cases through the cpp lib, the python lib, and our rust implementation (it's very crude and there's nothing that can just be plugged into CI yet).

I did it this way because, at least initially, it seemed like the simplest way to ensure I was getting the correct results. In particular, there is some subtlety to the error computation, I wanted to write something which was easily comparable to something written by Karney.

I based the error computations on Karney's own testing tool "GeodTest": https://sourceforge.net/p/geographiclib/code/ci/master/tree/tests/GeodTest.cpp

I've annotated the output of the bin/long-check script in a way that should help explain the output:

$ bin/long-check                                                                                                                                      

# The first output is from the GeodTest tool, written by Karney.
# I have just assumed it's correct, because I haven't written any of the code. 😅 
# Each row of output is in the format: component, max error encountered, test case which generated max error
# where component is like "0 (error when computing position), 1 (error in computed azimuth), 2 (error in m12), etc.
# IIRC the error is in nanometers (when compiled with double precision).

0 10.00 111872 # <-- across all test cases, for the "direct geodesic position" scenario, test case 111872 produced the largest error - 10.00 nanometers.
1 9.16 274130
2 8.75 106186
3 7.45 1930
4 3.60 5079
5 11.45 343135
6 9.80 32385

real    0m12.803s
user    0m12.559s
sys     0m0.159s

# The next output is from validate_geographiclib (our tool), validating Karney's GeodSolve binary. 
# This is intended as a sanity check to ensure validate_geographiclib is calculating the same error as Karney's GeodTest (validating the validation tool, if you will). Note that GeodTest tally's 7 different error scenarios, but we only test the first 4.
# The calculated errors *should* match GeodTest exactly, but somehow I see it's some hundredths of nm off. 
# This could be an issue with error calculation or an artifact of how we're formatting the output.

args: ["./target/debug/validate_geographiclib", "mjk/GeodSolve"]
position error (direct)  0 10.03 111872
azi error (direct)       1 9.16 274130
m12 error (direct)       2 8.72 106186
distance error (inverse) 3 7.45 1930

real    2m56.130s. # Note: 13x slower than GeodTest
user    2m3.725s
sys     0m21.366s

# The next output from validate_geographiclib (our tool), testing Karney's Python lib
# which I wrapped into a GeodSolve.py executable. I don't know that this is valuable. Maybe it's interesting?
# Note that the Calculated errors are slightly different than the cpp lib. The implementations  are different, 
# so this seems reasonable to me, so long as the errors remain reasonably small like this.

position error (direct)  0 12.66 365938
azi error (direct)       1 9.87 274130
m12 error (direct)       2 9.09 106186
distance error (inverse) 3 11.18 125538

real    5m57.831s
user    2m5.258s
sys     0m23.972s

# The next output is from validate_geographiclib (our tool), testing our rust lib  which I wrapped
# into a geodsolve executable.  This is the main thing we're probably interested in. 

args: ["./target/debug/validate_geographiclib", "../geographiclib-rs/target/debug/geodsolve"]
position error (direct)  0 12.66 365938
azi error (direct)       1 8.91 255921
m12 error (direct)       2 8.75 106186
distance error (inverse) 3 7.45 1930

real    2m25.823s # <-- note that though this is a debug build, it's a little faster than Karney. I assume that's due to I/O and the test harness because as far as the actual geographiclib algorithms, Karney's lib is still substantially faster than the rust code.
user    1m50.898s
sys     0m20.440s

Building a validation tool based on the existing GeodTest/GeodSolve tools was confidence-building, but it's much slower. At this point, I would be happy doing validation with direct calls rather than using stdio across processes if it's significantly faster.

michaelkirk commented 4 years ago

I guess I didn't really answer your question very clearly, but basically, I don't have a certain vision of what the best way forward would be, and it seems like you have a good sense of the tradeoffs, so follow what path seems best.

If you want more direction than that, personally my first approach would be:

copy in the error calculation code into geographiclib-rs project
write a unit test that reads in the test cases and runs them through direct calls to the rust geographic lib code and keeps track of the "largest error so far".
hard code the expected max error details into the test suite, and verify it matches what was computed.

The full set of test cases is pretty large and probably not suitable to run be default when someone runs cargo test, but maybe a quick version utilizing the small test set and then another test (gated by feature or something) that is run by CI, but not run in the default test suite (akin to bin/short-check vs. bin/long-check).

michaelkirk commented 4 years ago

Hi @stonylohr - I was curious if you were still working on this and if you had any questions I could answer.

stonylohr commented 4 years ago

I am still working on it, though gradually. Despite my earlier comment that I wasn't feeling up for digging into the C++ much, that's roughly the approach I've taken. My thinking here is that the C++ version is pretty much the gold standard, especially in terms of performance, but that I'm not convinced that using GeodTest makes for an ideal baseline, since timing it also ends up counting time spent processing inputs and writing outputs, and it only directly tests the high level direct and inverse functions. I think that direct testing of the lower level functions would be very helpful.

I modified a copy of the C++ source to create a modified version where every function of interest logs its inputs and outputs when it runs. I then ran that through a set of tests that mostly run direct and inverse scenarios based on GeodTest-short.dat, but with a mix of ellipsoids rather than all wgs84, plus a set of scenarios adapted from the java and python test cases. I'm currently working on a C++ test harness that runs through the resulting files to produce modified copies to be used as a baseline to validate ported geographiclib versions. The original files aren't suitable because their times include the time spent logging (and my logging is pretty inefficient).

A secondary reason to reprocess the original set of files is that, if there's any change in value from logging numbers out and reading them back in, the baseline should have to go through that same process, rather than working from the original. I'm guardedly optimistic that this won't be an issue, since I'm logging values out to 17 decimal places, which I think should avoid a loss of precision.

At my current rate of progress, I'm guessing it will be a couple of weeks before I wrap up my work on the C++ side. At that point, I'll switch to creating a rust counterpart to the reprocessing code, and build that into the CI.

michaelkirk commented 4 years ago

For validation purposes, sticking close to Karney's code seems reasonable. In fact, if you can get something working well, it may be worth seeing if it's something he'd like to incorporate into geographiclib proper (or the testing branch thereof).

stonylohr commented 4 years ago

By way of update, the week before last went roughly according to plan, but I wasn't able to spend much time on it this past week. I'm guardedly optimistic that I'll be able to make more time for it this coming week.

Once I wrap up my current draft on the C++ side, it might make sense for me to send you the C++ and data files. I'm not entirely sure where we'll want those to go. The data files are likely to be fairly large. The last run I did came out to about 100MB zipped. That size is likely to change a bit as I go, and I could do things like cut down my set of test cases to reduce the volume. Do you have thoughts on maximum reasonable file sizes, or whether you have a preferred way for me to get large files to you?

michaelkirk commented 4 years ago

Glad to hear you're still making progress. =)

The data files are likely to be fairly large.

For my understanding - are the data files only test cases? Or is there something more in the data files?

I also wanted to make sure also that you were aware of the official test cases here which were designed by Karney to cover edge cases in the algorithm: https://sourceforge.net/projects/geographiclib/files/testdata/

Though I don't recall if he covered any variety in ellipsoid shape or if they all assume wgs84.

stonylohr commented 4 years ago

The data files are test inputs and outputs from the C++ version of geographiclib.

I am familiar with the test cases you mentioned. While those do cover some edge cases, the GeodTest*.dat files there only include values for WGS84. My current test cases are based on calls generated by the following...

Run all GeodTest-short.dat cases in direct form (point 1 to point 2).
Run all GeodTest-short.dat cases in inverse form (point 1 to point 2).
Run a sampling of GeodTest-short.dat cases using a mix of direct and inverse on a selection of non-WGS84 ellipsoids. Note that the values in GeodTest-short.dat can't be used to validate these results on their own, because those assume a WGS84 ellipsoid, but that shouldn't be problematic, since my primary goal is to validate rust results vs C++ results.
Run all cases from Karney's tests in GeodesicTest.java. On cursory inspection, the python test cases appeared to be very similar, and possibly identical. I suspect he also has a C++ counterpart, but I haven't located it. Fortunately, it was easy to port to C++. These tests include some non-WGS84 cases, and even a handful of prolate spheroid cases.
Run a set of edge cases on most 1-argument and 2-argument Math functions (called GeoMath in some versions, including rust).

I capture the underlying calls from each of these higher-level calls to generate a test data file for each function in Math, Geodesic, GeodesicLine, PolygonArea and Accumulator, with a few exceptions.

Karney also gives an example of running the direct form from point 2 to point 1, but I'm not running that variation or the inverse from point 2 to point 1. I don't know whether these would add any particular edge cases not covered by the point 1 to point 2 formulations. It wouldn't be difficult to add them, but it would increase data file size by roughly a factor of 2.

stonylohr commented 4 years ago

The good news since my last update is that I located the geographiclib C++ test routines. In more mixed news, I decided to go ahead and incorporate them now, rather than coming back for them in a later pass. My reasoning is that it will be quicker for me now, while the C++ side of things is still fresh for me, but it's a questionable decision in terms of scope creep.

I'm also adding an option to limit the number of tests per section, as a way of creating manageable-sized sample output files to share. Files created using this option will end up much less heavily weighted toward WGS84 tests.

stonylohr commented 3 years ago

I've had several delays in the past month or so, but I think I've finally got a reasonably good baseline for the C++ code. I'm attaching a zip file with the details. My next step will be to move on to creating rust counterparts, and building them into the CI system. I also plan to examine individual file contents more closely during this next stage and clean up bugs in my C++ baseline code that turn up as a result.

I'm not sure how, or even whether the various C++ baseline files belong in either this repo or the validate_geographiclib repo. I'm fine with whatever you think is appropriate. In their current form, they feel like too much of a hack to be good candidates for anywhere in the core geographiclib repo itself.

ProgressReport_2020-12-01.zip

michaelkirk commented 3 years ago

I'm not sure how, or even whether the various C++ baseline files belong in either this repo or the validate_geographiclib repo. I'm fine with whatever you think is appropriate. In their current form, they feel like too much of a hack to be good candidates for anywhere in the core geographiclib repo itself.

Since this tool isn't really specific to the rust project, it might make more sense in its own repository and we could pull it down as part of CI or something, but those details seem flexible and not super important to the direction of your work.

A diff of some kind would be easier for me to comprehend the changes that you've made vs the files in that zip archive - like most of that is Karney's code right? Based on the readme, seems like you've essentially logged the inputs/outputs of the intermediate functions and presumably will have a way of comparing them across implementations. Right?

Looking forward to seeing what a more fleshed out integration looks like to get a better idea of how we could use it in CI.

stonylohr commented 3 years ago

Here's a fresh copy of the update that replaces the code with a diff against the latest Karney code, for my formulation that aims to produce baseline values of results and performance. That formulation includes my baseline utility files in the examples folder, and makes everything public in the geographiclib hpp files it cares about, so it can report values from usually-private variables.

This copy doesn't include the variation that adds logging in the core geographiclib files. I expect that I'll refresh that at some point as I work through the rust baseline, but I don't have the energy for it tonight.

You're exactly right about the comparison. My thought is basically to create rust code that uses the same format as the *.dat files in these zip files, and then logic to compare the rust values against the C++ values, and reject if they differ by too much. Initially, I'll probably just set the tolerances based on whatever the difference is on the first run. The idea would then be that if a change to the rust code increases the difference for anything, it would be flagged as a problem.

Similarly, I'll see about putting together something to at least report a summary of time differences between the rust and C++ implementations. I have a little more figuring out to do in terms of how much to automate having this complain, since it's dependent on the specific machine it's being run on, and variations in what other load the machine might be dealing with at runtime.

ProgressReport_2020-12-02.zip

michaelkirk commented 3 years ago

Ah, thanks for the patch file.

Holy smokes - 10k lines! I'm not really sure where to start with it.

From the patch - It seems like the only changes you've made to Karney's code is making methods public.

And some of the remainder seems pretty rote - just instrumentation to dump the fields from the various structures corresponding to all the methods.

To clarify - it seems like you're instrumenting more than just the functionality that we currently have in this project - e.g "CassiniSoldner" and "AlbersEqualArea" aren't anything we currently support in this lib (maybe one day though!).

There's a lot of code that looks like it's probably from the Karney test suite - but then you've left some comments too. Can you provide any more structure to the code? Breaking things into logical files/modules might add sine clarity to the pieces. And if you put it into it's own repo on GitHub rather than passing a zip archive around, it might be easier to reference specific parts.

You're exactly right about the comparison. My thought is basically to create rust code that uses the same format as the *.dat files in these zip files, and then logic to compare the rust values against the C++ values, and reject if they differ by too much. Initially, I'll probably just set the tolerances based on whatever the difference is on the first run. The idea would then be that if a change to the rust code increases the difference for anything, it would be flagged as a problem.

That makes sense to me - just a regression check for starters.

stonylohr commented 3 years ago

Right: Lots of volume, but really nothing very fancy.

I decided to go ahead and include support for everything that's covered (directly or indirectly) by the C++ tests, except for things that rely on the external gravity and magnetic data files for their tests. You're right that a lot of it isn't currently included in geographiclib-rust.

Your suggestion of setting up a separate repo for the instrumented geographiclib C++ sounds like a reasonable one. I'll look into that within the next week or two, but may allow myself a little time on the rust side first.

stonylohr commented 3 years ago

I've created a fork of the main geographiclib repo, with instrumentation and baseline code in a branch: https://sourceforge.net/u/alephknot/geographiclib/ci/instrumented-crude/tree/

It includes many of the format changes we discussed, and removes tests if there's an identical test within 100 lines. I expect those cases to have no value in terms of unit or integration testing, and limited value in terms of benchmarking. Between these and removal of tests for now-deprecated geographiclib operations (it switched from custom implementations to std library calls for some), the zipped output is down to about 130MB. Sadly, that's still too large for sourceforge's taste.

Next I'll take another stab at the rust tests. I'm not sure yet whether I'll try to have it generate the data files itself, or whether I'll try to come up with a work-around of sourceforge's rejection of them, like maybe zipping each file individually.

stonylohr commented 3 years ago

@michaelkirk I think we can close this issue. Would you agree?

michaelkirk commented 3 years ago

Agree, thanks for seeing this through!

georust / geographiclib-rs

add integrated CI testing vs geographiclib C++ implementation #13