Closed michaelkirk closed 2 years ago
Just to make sure I'm understanding this: There are currently two possible sources of difference:
And you're going to try to eliminate (1). That's quite a surprising source!
Ha, your instincts are better than mine. I'm pretty sure that's not the issue after poking at it a bit.
My kingdom for even a wild theory as to how this could be happening...
The "bundled_proj" script builds from: https://github.com/georust/proj/pull/97/files#diff-27d669a056374d86fef7e0627a9011d3ef7388e4c16e67a981b1bfb6cc83413cR84
And that seems to be working all around. 👍
The containers which are not using bundled_proj are using the system proj. But it's not installed from apt or anything. It's built from this (seemingly identical) source code:
built here: https://github.com/georust/docker-images/blob/mkirk/update-proj/rust-1.52/libproj-builder.Dockerfile
and copied here: https://github.com/georust/docker-images/blob/mkirk/update-proj/rust-1.52/proj-ci.Dockerfile#L10
How would this be different?
(Again, just so I'm clear here) the staged build copies the binary? Where on earth could a difference be coming from?
(Again, just so I'm clear here) the staged build copies the binary? Where on earth could a difference be coming from?
That's right. In a little more detail:
The libproj-builder container compiles libproj (from the 8.2.0 source), but doesn't install it.
Containers which want libproj pre-installed (like proj-ci and geo-ci) copy it from the libproj-builder container into their own system path.
I can't figure out what the difference would be in the behavior of the proj compiled by the bundled_proj flag vs copied from the libproj-builder container.
even wilder:
So when running all tests, three fail.
rust-1.52> cargo test
failures:
proj::test::test_inverse_projection
proj::test::test_london_inverse
proj::test::test_projection
But when I run the failed tests indivudally:
rust-1.52> cargo test proj::test::test_inverse_projection
running 1 test
test proj::test::test_inverse_projection ... ok
passes!
rust-1.52> cargo test proj::test::test_london_inverse
running 1 test
test proj::test::test_london_inverse ... ok
passes!
rust-1.52> cargo test test_projection
test proj::test::test_projection ... ok
passes!
Don't feel like you have to spend time on this, I'll keep plugging away, just documenting my findings as I go.
One more wild one... when running these two tests repeatedly, sometimes one fails, sometimes the other. It seems like there is some weird external effect at play, like some kind of race condition.
rust-1.52> cargo test projection
Finished test [unoptimized + debuginfo] target(s) in 0.06s
Running unittests (/tmp/cargo-target/debug/deps/proj-ec6d6ab38686495b)
running 2 tests
test proj::test::test_inverse_projection ... ok
test proj::test::test_projection ... FAILED
failures:
---- proj::test::test_projection stdout ----
thread 'proj::test::test_projection' panicked at 'assert_relative_eq!(t.x(), 500119.7035366755, epsilon = 1e-5)
left = 499972.70746607287
right = 500119.7035366755
', src/proj.rs:1054:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
proj::test::test_projection
test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.01s
root@6009bf331ef0:/tmp/georust/proj# /usr/local/cargo/bin/cargo test projection
Finished test [unoptimized + debuginfo] target(s) in 0.06s
Running unittests (/tmp/cargo-target/debug/deps/proj-ec6d6ab38686495b)
running 2 tests
test proj::test::test_projection ... ok
test proj::test::test_inverse_projection ... FAILED
failures:
---- proj::test::test_inverse_projection stdout ----
thread 'proj::test::test_inverse_projection' panicked at 'assert_relative_eq!(t.x(), 0.43633200013698786)
left = 0.43636512759931073
right = 0.43633200013698786
', src/proj.rs:1069:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
proj::test::test_inverse_projection
test result: FAILED. 1 passed; 1 failed; 0 ignored; 0 measured; 13 filtered out; finished in 0.01s
Also maybe/maybe not interesting, when I specify test-threads=1, test still fail, but they seem to follow a slightly different pattern.
vague theories:
There's also the risk of some kind of memory corruption thing, but if that were the case, I might expect the incorrect values to be less close and/or less consistent.
Oh no. What happens when you run two of them together, or add another conversion test? Because that sounds like memory might be getting clobbered by an FFI error somewhere…
The network grid seems less likely, if only because we throw errors if it doesn't work. I guess an easy way to confirm that is to add a test that doesn't rely on the network and see if it fails in the same way?
Re FFI: the issue causing https://github.com/georust/proj/pull/76 has been in the back of my mind.
Re FFI: the issue causing #76 has been in the back of my mind.
Can you refresh my memory - what was the issue that lead to #76?
Ugh sorry, I was thinking of https://github.com/georust/proj/issues/75, though they are related.
Continuing to noodle on this...
Just as a sanity check...
// Carry out a projection from geodetic coordinates
fn test_projection() {
let stereo70 = Proj::new(
"+proj=sterea +lat_0=46 +lon_0=25 +k=0.99975 +x_0=500000 +y_0=500000
+ellps=krass +towgs84=33.4,-146.6,-76.3,-0.359,-0.053,0.844,-0.84 +units=m +no_defs",
)
.unwrap();
// Geodetic -> Pulkovo 1942(58) / Stereo70 (EPSG 3844)
let t = stereo70
.project(MyPoint::new(0.436332, 0.802851), false)
.unwrap();
assert_relative_eq!(t.x(), 500119.7035366755, epsilon = 1e-5);
assert_relative_eq!(t.y(), 500027.77901023754, epsilon = 1e-5);
}
This is an invalid coordinate, right?
MyPoint::new(0.436332, 0.802851)
Since it's a geodetic coordinate, shouldn't y always be less than 0.5 radians?
I don't think so…you can readily convert it to degrees, which is a valid point in Romania (which is expected, given the Stereo70 projection)
I don't think so…you can readily convert it to degrees, which is a valid point in Romania (which is expected, given the Stereo70 projection)
Ugh, of course. Sorry just confusing myself with really basic stuff.
On the plus side, I seem to have gotten tests passing locally with https://github.com/georust/proj/pull/97/commits/fb89caf3894945be2936ad1190bdf7c8a6295ba4
I'm not 100% sure on exactly what's happening, but I'm continuing to investigate.
Let's see what CI says...
Ok, it seems to be passing CI now.
Honestly, I think the change is somewhat "intuitive", in that if we're passing in degrees to project we use PJ_LP, and if we're doing an inverse projection, then we'd use PJ_XY to denote a projected Point (e.g. in meters).
But I confess I have no idea why this fixes anything! It's just a regular ole dumb C-union, so the representation of the PJ_XY vs PJ_LP variants of PJ_COORD should be identical.
... right?
(disclaimer: I'm not well versed on c memory layout)
🟢! How did you decide to try that particular refactor?
Ok, it seems to be passing CI now.
Honestly, I think the change is somewhat "intuitive", in that if we're passing in degrees to project we use PJ_LP, and if we're doing an inverse projection, then we'd use PJ_XY to denote a projected Point (e.g. in meters).
But I confess I have no idea why this fixes anything! It's just a regular ole dumb C-union, so the representation of the PJ_XY vs PJ_LP variants of PJ_COORD should be identical.
... right?
That's also my understanding – when I originally implemented it it looked like a simple union so I just shrugged and moved on, but I didn't look at the proj source to see whether it was doing anything different based on whether it was PJ_XY or PJ_LP…
How did you decide to try that particular refactor?
Only intuition after reading PJ_COORD and its various flavors.
My read was that PJ_LP was for angular coords and PJ_XY was for projected coords - it seemed like we were using PJ_LP always.
But as I said above, I don't actually understand why this fixes anything, because my understanding is that their memory layout should be identical.
I'm going to read some proj source and try to learn something more about c unions...
on whether it was PJ_XY or PJ_LP…
But as I understand it, it's equally both of those things. A basic union (as opposed to a tagged union) has no notion of what variant it is it just relies on the programmer to handle that.
on whether it was PJ_XY or PJ_LP…
But as I understand it, it's equally both of those things. A basic union (as opposed to a tagged union) has no notion of what variant it is it just relies on the programmer to handle that.
I am out over my skis if that isn't the case, but given that it's just fixed what we suspected was a threading-related bug…
Just by way of an update:
The tests are still occasionally failing for me, but seemingly much less with the changes around LP vs. XY. (I intermittently get failing results)
I feel pretty confident that this change is inducing different behavior, but not in any way that makes sense. I assume there is still some yet to be understood bug (presumably in our code) and that this change has just jostled the machine code around in such a way that we're avoiding the bug more often.
I'll keep looking into this, but maybe not for a couple days.
I wonder if maybe it's static vs. dynamic linking that's causing the bug to present vs. not.
That could explain why --features bundled_proj
(which statically links libproj) could succeed while linking to the system build (which links dynamically if available) of the exactly same build of proj (e.g. on proj-ci) could intermittently fail.
It doesn't help fix the bug, but at least it's a theory for how the two scenarios could be behaving differently.
bors try
superseded by #118
Just a draft for now as I work through some outstanding issues...
I've made the update, and rebuilt containers, but I'm still seeing some mysterious failures...
For some reason using the pre-built libproj seems to be causing some smallish discrepencies on older containers. TBH I suspect the base system rather than the actual rust version.
next step is to try to see if just doing an
apt upgrade
fixes anything...https://github.com/georust/proj/actions/runs/1580354293