upgrade to proj 8.2 - Githubissues

michaelkirk commented 2 years ago

Just a draft for now as I work through some outstanding issues...

I've made the update, and rebuilt containers, but I'm still seeing some mysterious failures...

✅ proj-sys on rust-1.57
✅ proj-sys on rust-1.52
✅ proj on rust-1.57
- ✅ proj w/ --features=bundle
- ✅ proj using prebuilt proj
⁉️ proj on rust-1.52
- ✅ proj w/ --features=bundle
- 🔥 proj using prebuilt proj

For some reason using the pre-built libproj seems to be causing some smallish discrepencies on older containers. TBH I suspect the base system rather than the actual rust version.

next step is to try to see if just doing an apt upgrade fixes anything...

https://github.com/georust/proj/actions/runs/1580354293

failures:
---- proj::test::test_inverse_projection stdout ----
thread 'proj::test::test_inverse_projection' panicked at 'assert_relative_eq!(t.x(), 0.43633200013698786)

    left  = 0.43636512759931073
    right = 0.43633200013698786

', src/proj.rs:1069:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- proj::test::test_london_inverse stdout ----
thread 'proj::test::test_london_inverse' panicked at 'assert_relative_eq!(t.x(), 0.0023755864830313977)

    left  = 0.002407322568220306
    right = 0.0023755864830313977
', src/proj.rs:1086:9

---- proj::test::test_projection stdout ----
thread 'proj::test::test_projection' panicked at 'assert_relative_eq!(t.x(), 500119.7035366755, epsilon = 1e-5)

    left  = 499972.70746607287
    right = 500119.7035366755

', src/proj.rs:1054:9

failures:
    proj::test::test_inverse_projection
    proj::test::test_london_inverse
    proj::test::test_projection

urschrei commented 2 years ago

Just to make sure I'm understanding this: There are currently two possible sources of difference:

Container libraries
Rust version

And you're going to try to eliminate (1). That's quite a surprising source!

michaelkirk commented 2 years ago

Ha, your instincts are better than mine. I'm pretty sure that's not the issue after poking at it a bit.

michaelkirk commented 2 years ago

My kingdom for even a wild theory as to how this could be happening...

The "bundled_proj" script builds from: https://github.com/georust/proj/pull/97/files#diff-27d669a056374d86fef7e0627a9011d3ef7388e4c16e67a981b1bfb6cc83413cR84

And that seems to be working all around. 👍

The containers which are not using bundled_proj are using the system proj. But it's not installed from apt or anything. It's built from this (seemingly identical) source code:

built here: https://github.com/georust/docker-images/blob/mkirk/update-proj/rust-1.52/libproj-builder.Dockerfile

and copied here: https://github.com/georust/docker-images/blob/mkirk/update-proj/rust-1.52/proj-ci.Dockerfile#L10

How would this be different?

urschrei commented 2 years ago

(Again, just so I'm clear here) the staged build copies the binary? Where on earth could a difference be coming from?

michaelkirk commented 2 years ago

(Again, just so I'm clear here) the staged build copies the binary? Where on earth could a difference be coming from?

That's right. In a little more detail:

The libproj-builder container compiles libproj (from the 8.2.0 source), but doesn't install it.

Containers which want libproj pre-installed (like proj-ci and geo-ci) copy it from the libproj-builder container into their own system path.

I can't figure out what the difference would be in the behavior of the proj compiled by the bundled_proj flag vs copied from the libproj-builder container.

michaelkirk commented 2 years ago

even wilder:

So when running all tests, three fail.

rust-1.52> cargo test
failures:
    proj::test::test_inverse_projection
    proj::test::test_london_inverse
    proj::test::test_projection

But when I run the failed tests indivudally:

rust-1.52> cargo test proj::test::test_inverse_projection
running 1 test
test proj::test::test_inverse_projection ... ok

passes!

rust-1.52> cargo test proj::test::test_london_inverse
running 1 test
test proj::test::test_london_inverse ... ok

passes!

rust-1.52> cargo test test_projection
test proj::test::test_projection ... ok

passes!

michaelkirk commented 2 years ago

Don't feel like you have to spend time on this, I'll keep plugging away, just documenting my findings as I go.

One more wild one... when running these two tests repeatedly, sometimes one fails, sometimes the other. It seems like there is some weird external effect at play, like some kind of race condition.

rust-1.52> cargo test projection
    Finished test [unoptimized + debuginfo] target(s) in 0.06s
     Running unittests (/tmp/cargo-target/debug/deps/proj-ec6d6ab38686495b)

running 2 tests
test proj::test::test_inverse_projection ... ok
test proj::test::test_projection ... FAILED

failures:

---- proj::test::test_projection stdout ----
thread 'proj::test::test_projection' panicked at 'assert_relative_eq!(t.x(), 500119.7035366755, epsilon = 1e-5)

    left  = 499972.70746607287
    right = 500119.7035366755

', src/proj.rs:1054:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    proj::test::test_projection

test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.01s

root@6009bf331ef0:/tmp/georust/proj# /usr/local/cargo/bin/cargo test projection
    Finished test [unoptimized + debuginfo] target(s) in 0.06s
     Running unittests (/tmp/cargo-target/debug/deps/proj-ec6d6ab38686495b)

running 2 tests
test proj::test::test_projection ... ok
test proj::test::test_inverse_projection ... FAILED

failures:

---- proj::test::test_inverse_projection stdout ----
thread 'proj::test::test_inverse_projection' panicked at 'assert_relative_eq!(t.x(), 0.43633200013698786)

    left  = 0.43636512759931073
    right = 0.43633200013698786

', src/proj.rs:1069:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    proj::test::test_inverse_projection

test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.01s

Also maybe/maybe not interesting, when I specify test-threads=1, test still fail, but they seem to follow a slightly different pattern.

vague theories:

some side effect caching bug?
maybe something in the network grid?

There's also the risk of some kind of memory corruption thing, but if that were the case, I might expect the incorrect values to be less close and/or less consistent.

urschrei commented 2 years ago

Oh no. What happens when you run two of them together, or add another conversion test? Because that sounds like memory might be getting clobbered by an FFI error somewhere…

urschrei commented 2 years ago

The network grid seems less likely, if only because we throw errors if it doesn't work. I guess an easy way to confirm that is to add a test that doesn't rely on the network and see if it fails in the same way?

urschrei commented 2 years ago

Re FFI: the issue causing https://github.com/georust/proj/pull/76 has been in the back of my mind.

michaelkirk commented 2 years ago

Re FFI: the issue causing #76 has been in the back of my mind.

Can you refresh my memory - what was the issue that lead to #76?

urschrei commented 2 years ago

Ugh sorry, I was thinking of https://github.com/georust/proj/issues/75, though they are related.

michaelkirk commented 2 years ago

Continuing to noodle on this...

Just as a sanity check...

    // Carry out a projection from geodetic coordinates
    fn test_projection() {
        let stereo70 = Proj::new(
            "+proj=sterea +lat_0=46 +lon_0=25 +k=0.99975 +x_0=500000 +y_0=500000
            +ellps=krass +towgs84=33.4,-146.6,-76.3,-0.359,-0.053,0.844,-0.84 +units=m +no_defs",
        )
        .unwrap();
        // Geodetic -> Pulkovo 1942(58) / Stereo70 (EPSG 3844)
        let t = stereo70
            .project(MyPoint::new(0.436332, 0.802851), false)
            .unwrap();
        assert_relative_eq!(t.x(), 500119.7035366755, epsilon = 1e-5);
        assert_relative_eq!(t.y(), 500027.77901023754, epsilon = 1e-5);
    }

This is an invalid coordinate, right?

MyPoint::new(0.436332, 0.802851)

Since it's a geodetic coordinate, shouldn't y always be less than 0.5 radians?

urschrei commented 2 years ago

I don't think so…you can readily convert it to degrees, which is a valid point in Romania (which is expected, given the Stereo70 projection)

michaelkirk commented 2 years ago

I don't think so…you can readily convert it to degrees, which is a valid point in Romania (which is expected, given the Stereo70 projection)

Ugh, of course. Sorry just confusing myself with really basic stuff.

On the plus side, I seem to have gotten tests passing locally with https://github.com/georust/proj/pull/97/commits/fb89caf3894945be2936ad1190bdf7c8a6295ba4

I'm not 100% sure on exactly what's happening, but I'm continuing to investigate.

Let's see what CI says...

michaelkirk commented 2 years ago

Ok, it seems to be passing CI now.

Honestly, I think the change is somewhat "intuitive", in that if we're passing in degrees to project we use PJ_LP, and if we're doing an inverse projection, then we'd use PJ_XY to denote a projected Point (e.g. in meters).

But I confess I have no idea why this fixes anything! It's just a regular ole dumb C-union, so the representation of the PJ_XY vs PJ_LP variants of PJ_COORD should be identical.

... right?

michaelkirk commented 2 years ago

(disclaimer: I'm not well versed on c memory layout)

urschrei commented 2 years ago

🟢! How did you decide to try that particular refactor?

urschrei commented 2 years ago

Ok, it seems to be passing CI now.

Honestly, I think the change is somewhat "intuitive", in that if we're passing in degrees to project we use PJ_LP, and if we're doing an inverse projection, then we'd use PJ_XY to denote a projected Point (e.g. in meters).

But I confess I have no idea why this fixes anything! It's just a regular ole dumb C-union, so the representation of the PJ_XY vs PJ_LP variants of PJ_COORD should be identical.

... right?

That's also my understanding – when I originally implemented it it looked like a simple union so I just shrugged and moved on, but I didn't look at the proj source to see whether it was doing anything different based on whether it was PJ_XY or PJ_LP…

michaelkirk commented 2 years ago

How did you decide to try that particular refactor?

Only intuition after reading PJ_COORD and its various flavors.

My read was that PJ_LP was for angular coords and PJ_XY was for projected coords - it seemed like we were using PJ_LP always.

But as I said above, I don't actually understand why this fixes anything, because my understanding is that their memory layout should be identical.

I'm going to read some proj source and try to learn something more about c unions...

michaelkirk commented 2 years ago

on whether it was PJ_XY or PJ_LP…

But as I understand it, it's equally both of those things. A basic union (as opposed to a tagged union) has no notion of what variant it is it just relies on the programmer to handle that.

urschrei commented 2 years ago

on whether it was PJ_XY or PJ_LP…

But as I understand it, it's equally both of those things. A basic union (as opposed to a tagged union) has no notion of what variant it is it just relies on the programmer to handle that.

I am out over my skis if that isn't the case, but given that it's just fixed what we suspected was a threading-related bug…

michaelkirk commented 2 years ago

Just by way of an update:

The tests are still occasionally failing for me, but seemingly much less with the changes around LP vs. XY. (I intermittently get failing results)

I feel pretty confident that this change is inducing different behavior, but not in any way that makes sense. I assume there is still some yet to be understood bug (presumably in our code) and that this change has just jostled the machine code around in such a way that we're avoiding the bug more often.

I'll keep looking into this, but maybe not for a couple days.

michaelkirk commented 2 years ago

I wonder if maybe it's static vs. dynamic linking that's causing the bug to present vs. not.

That could explain why --features bundled_proj (which statically links libproj) could succeed while linking to the system build (which links dynamically if available) of the exactly same build of proj (e.g. on proj-ci) could intermittently fail.

It doesn't help fix the bug, but at least it's a theory for how the two scenarios could be behaving differently.

michaelkirk commented 2 years ago

bors try

bors[bot] commented 2 years ago

try

Build failed:

ci result

michaelkirk commented 2 years ago

superseded by #118

georust / proj

upgrade to proj 8.2 #97

try