borglab / gtsam

GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices.
http://gtsam.org
Other
2.47k stars 743 forks source link

aarch64 Nonlinear Optimizer Issues #803

Open jaelrod opened 3 years ago

jaelrod commented 3 years ago

Description

I have some unit tests that chain together some BetweenFactor using an odometry mock, inject some slightly contradictory supplemental evidence in the form of other factors, such as a unary factor akin to the "GPS-like" factor from LocalizationExample.cpp, and finally use LevenbergMarquardtOptimizer to optimize() the factor graph and assert the result.

For example, a test might start with some rather noisy mocked odometry moving forward at a consistent speed with a heading of pi/4, qualify the noisy estimates with some very not-noisy GPS-like factors that suggest movement is actually happening in the pi/2 direction, rather than the pi/4 direction, then optimize the graph and make sure that the optimized poses were along the trajectory suggested by the more reliable GPS-like factors.

Such a test and several others work fine on my desktop (x86_64, Ubuntu 18.04) but fail on my Jetson Xavier NX (aarch64, Ubuntu 18.04).

Checking out borglab gtsam source from github and running make check to invoke unit tests on the aarch64 device results in many failures, which I expect are related to my own, the following example of which I think is similar to one of the more basic included examples in gtsam :

gtsam/tests/testNonlinearOptimizer.cpp:484: Failure: "assert_equal(expected, dl_result, tol)" 
Not equal:
expected:
Values with 3 values:
Value x1: (gtsam::Pose2)
(0, 0, 0)
Value x2: (gtsam::Pose2)
(1.5, 0, 0)
Value x3: (gtsam::Pose2)
(3, 0, 0)

actual:
Values with 3 values:
Value x1: (gtsam::Pose2)
(-3.74995e-06, 0, 0)
Value x2: (gtsam::Pose2)
(1.49999, 0, 0)
Value x3: (gtsam::Pose2)
(3, 0, 0)

I am in the process of debugging gtsam source through verbose optimizer output and will provide updates as I make progress, but I am suspicious of Eigen anti-aliasing or use of auto keyword and type resolution on aarch64.

One thing I have observed so far only in aarch64 is:

Partial Cholesky on HessianFactor failed.

Frontal keys Position 0: 0
HessianFactor:
 keys: 0(3) 1(3) 
Augmented information matrix: [
    1.00453889587e+81, -8.863578493e+80, 1.35333966459e+82, -1.22992576855e+42, -7.68703605343e+40, 0, -1.78961791309e+41;
    -8.863578493e+80, 1.00453889587e+81, -1.19233688982e+82, 1.22992576855e+42, -7.68703605343e+40, 0, 3.06711287414e+41;
    1.35333966459e+82, -1.19233688982e+82, 1.82326703797e+83, -1.65582451852e+43, -1.04723572027e+42, -400, -2.39906858576e+42;
    -1.22992576855e+42, 1.22992576855e+42, -1.65582451852e+43, 1600, 0, 0, 315.903994301;
    -7.68703605343e+40, -7.68703605343e+40, -1.04723572027e+42, 0, 100, 0, -83.0941179516;
    0, 0, -400, 0, 0, 400, 628.318530718;
    -1.78961791309e+41, 3.06711287414e+41, -2.39906858576e+42, 315.903994301, -83.0941179516, 628.318530718, 304968.261602
]

Another thing I have noticed in aarch64 is that the initial total error of the same system seems much lower than that of amd64, possibly resulting in less/no optimization iterations.

Update 1: Seems like the calculation of error of BetweenFactor<Pose2> is indeed incorrect on aarch64.

Debug output from amd64 of BetweenFactor::evaluateError for the very first error calculation based on estimates:

p1: (11, 10.5, 0.392699081699), p2: (20, 23, 0.785398163397), err: [13.1184182938, 7.63899870558, 0.352566699052]

Debug output from aarch64 of BetweenFactor::evaluateError for almost the exact same poses:

p1: (11, 10.5, 0.392699081699), p2: (20, 23, 0.785398163397), err: [-0.261580542738, -0.309698692039, -1.57079632679]

Playing around inside of BetweenFactor and LieGroup but haven't had much luck so far.

Update 2: I haven't 100% figured out why, but Rot2 inverse() in geometry/Rot2.h was returning an inverse with incorrect heading angle when inverting Pose2 as a part of factorization of BetweenFactor<Pose2>.

Changing this line, https://github.com/borglab/gtsam/blob/aec2cf06a5501840f0302344bca03c197522a335/gtsam/geometry/Rot2.h#L110 to read the following fixes at least that issue (seems like there are more still to find): Rot2 inverse() const { return fromCosSin(c_, -s_);}

Like I said, not yet sure why this makes a difference, as all fromCosSin appears to be doing is normalizing c & s, and those should have already been normalized...

Update 3: Also, Rot2::operator*(const Rot2& R) does not behave as expected in aarch64. The following seems to relieve the problem: return fromAngle(theta() + R.theta());

Although, again, I am not sure why this works, as making use of the trig identity for cos(th1+th2) and sin(th1+th2) as is done in master branch should be identical in terms of resulting behavior...

Steps to reproduce

  1. Using an aarch64 device
  2. Create a factor graph with some BetweenFactor<Pose2> and some initial estimates
  3. Optimize the factor graph
  4. Optimization does not work as expected

Expected behavior

Factor graph optimizes in the same way it does on amd64.

Environment

Jetson Xavier NX, Ubuntu 18.04

jaelrod commented 3 years ago

Somehow (I think it is maybe happening when the LM optimizer is initialized) the left operand of Rot2*Rot2 has an uninitialized c_ and s_ during an invocation before the optimizer takes its first iteration. I think what's happening is that when this uninitialized rotation gets inverted and multiplied during the optimization iterations, the cosine and/or sine of the rotation component of pose estimates becomes extremely positive or negative, the result of which is atan2() for the pose heading angle returning one of its inflection points, as documented in cppreference. Have not figured out from where this uninitialized rotation is coming, but given that reproducing it repeatedly yields wildly different figures for c_ and s_ for that rotation that shows up before the optimization, that does appear to be what is happening.

Debug output from Rot2 before optimizer's first linearization:

Rot2::operator*(), c_: 3.03428333399e-86, s_: 1.14371993161e-71

Initial values
Values with 1 values:
Value 0: (gtsam::Pose2)
(10.0808679401, 9.99346051488, 0.428400412304)
ProfFan commented 3 years ago

Yes I agree this could be uninitialized memory. Could you run the unit tests with clang sanitizer on?

jaelrod commented 3 years ago

Despite having clang installed on my Jetson, it looks like GCC is the compiler in use both when building my entire project as well as when building only gtsam with make check.

Switched to use clang and have: clang: error: the clang compiler does not support '-march=native'

Is disabling that going to cause other issues in aarch64..?

ProfFan commented 3 years ago

Interesting that -march=native is unsupported, let me try on my Jetson Nano.

jaelrod commented 3 years ago

(Currently building it on my Jetson with GTSAM_BUILD_WITH_MARCH_NATIVE OFF. Build seems to be proceeding, but it takes long enough to complete that I thought I would go ahead and mention that it is currently underway over here with

--  C compilation flags                              :  -fsanitize=memory -O3 -DNDEBUG
--  C++ compilation flags                            :  -fsanitize=memory -O3 -DNDEBUG

)

ProfFan commented 3 years ago

Maybe you can try cross compiling on PC. The Jetson CPU cores are very slow.

ProfFan commented 3 years ago

*also compiling on my Jetson Nano here

jlblancoc commented 3 years ago

Any hints out of running make testXXX.valgrind on a native aarch64 system?

jaelrod commented 3 years ago

Things seem to have successfully compiled on my Jetson using clang-6 with -fsanitize=memory on, but I don't really see anything different when running gtsam tests. Any luck on your side, @ProfFan ?

@jiblancoc thanks for the suggestion! I am still getting re-acquainted with C++ so did not know about valgrind. Running make testNonlinearOptimizer.valgrind on amd64 looks pretty clean (some of the tests fail but no memory errors reported). Running it in aarch64 reports loads of issues looking like out-of-bounds access (both writes & reads), some of which seem to be fairly pertinent (these are in the order in which they appear):

==2475== Invalid write of size 8
==2475==    at 0x4A5227C: gtsam::Pose2::equals(gtsam::Pose2 const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4BB4507: gtsam::Values::equals(gtsam::Values const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4342C7: bool gtsam::assert_equal<gtsam::Values>(gtsam::Values const&, gtsam::Values const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x42168B: NonlinearOptimizerFactorizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x444E43: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==  Address 0x1ffefffa60 is on thread 1's stack
==2475==  48 bytes below stack pointer
==2475== Invalid write of size 8
==2475==    at 0x4B64450: gtsam::VectorValues::equals(gtsam::VectorValues const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x43561B: bool gtsam::assert_equal<gtsam::VectorValues>(gtsam::VectorValues const&, gtsam::VectorValues const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x424633: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==  Address 0x1ffeffef40 is on thread 1's stack
==2475==  48 bytes below stack pointer
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x4A3A18C: gtsam::equal_with_abs_tol(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&, Eigen::Matrix<double, -1, 1, 0, -1, 1> const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B644C7: gtsam::VectorValues::equals(gtsam::VectorValues const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x43561B: bool gtsam::assert_equal<gtsam::VectorValues>(gtsam::VectorValues const&, gtsam::VectorValues const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x424633: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x4BB0D2C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x4BB0D2C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x4BB0D30: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x4BB0D3C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x511ED5C: cos (s_sin.c:559)
==2475==    by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Conditional jump or move depends on uninitialised value(s)
==2475==    at 0x511EDA8: cos (s_sin.c:562)
==2475==    by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== 
==2475== Use of uninitialised value of size 8
==2475==    at 0x511EE28: do_cos (s_sin.c:165)
==2475==    by 0x511EE28: cos (s_sin.c:564)
==2475==    by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475==    by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)

Even this very very basic test TEST( NonlinearOptimizer, paramsEquals ) experiences an out-of-bounds error:

==2475== Invalid write of size 8
==2475==    at 0x41B85C: NonlinearOptimizerparamsEqualsTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x444E43: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==    by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==  Address 0x1ffeffff40 is on thread 1's stack
==2475==  96 bytes below stack pointer
ProfFan commented 3 years ago

Thanks for the report! Could you also run valgrind with this: https://www.valgrind.org/docs/manual/mc-manual.html#opt.track-origins

BTW, are you using JetPack? I think JetPack is still on Ubuntu Bionic whose Boost version has some bugs that prevents ubsan and asan from working. (I tried and failed on my Jetson Nano).

jaelrod commented 3 years ago

Excellent to know about track-origins, thanks for highlighting that option! I will definitely have to try that out.

After much tedious debugging, I found that the functions returning by-value instances of Rot2 were the culprit of the garbage values I was seeing. Values of cosine & sine of rotation looked how they should during construction of the instances to be returned, but when the value was actually returned, those good values disappeared and the calling code was left with junk. Adding a trivial Rot2 copy constructor as a user-defined constructor, Rot2(const Rot2& rot) : c_(rot.c_), s_(rot.s_) {}, does result in it getting called when returning Rot2 by value from a function, and it does actually resolve the uninitialized values issue (without needing my cosine & sine normalization band-aid fix from above). The same test make testNonlinearOptimizer.valgrind drops from loads of uninitialized memory errors to only the following two:

==23437== Invalid read of size 8
==23437==    at 0x144C4C: gtsam::BetweenFactor<Eigen::Matrix<double, 2, 1, 0, 2, 1> >::evaluateError(Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x143B77: gtsam::NoiseModelFactor2<Eigen::Matrix<double, 2, 1, 0, 2, 1>, Eigen::Matrix<double, 2, 1, 0, 2, 1> >::unwhitenedError(gtsam::Values const&, boost::optional<std::vector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, std::allocator<Eigen::Matrix<double, -1, -1, 0, -1, -1> > >&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x4BB260F: gtsam::NoiseModelFactor::error(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4BB4C07: gtsam::NonlinearFactorGraph::error(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4B7B807: gtsam::GaussNewtonOptimizer::GaussNewtonOptimizer(gtsam::NonlinearFactorGraph const&, gtsam::Values const&, gtsam::GaussNewtonParams const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x13534F: NonlinearOptimizerPoint2LinearOptimizationWithHuberTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x145FB7: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x12A5B3: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==  Address 0x5656ce0 is 0 bytes after a block of size 16 alloc'd
==23437==    at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==23437== 
==23437== Invalid read of size 8
==23437==    at 0x144C4C: gtsam::BetweenFactor<Eigen::Matrix<double, 2, 1, 0, 2, 1> >::evaluateError(Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x143BDB: gtsam::NoiseModelFactor2<Eigen::Matrix<double, 2, 1, 0, 2, 1>, Eigen::Matrix<double, 2, 1, 0, 2, 1> >::unwhitenedError(gtsam::Values const&, boost::optional<std::vector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, std::allocator<Eigen::Matrix<double, -1, -1, 0, -1, -1> > >&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x4BB2893: gtsam::NoiseModelFactor::linearize(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4BB420F: gtsam::(anonymous namespace)::_LinearizeOneFactor::operator()(tbb::blocked_range<unsigned long> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4BB440B: tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, gtsam::(anonymous namespace)::_LinearizeOneFactor, tbb::auto_partitioner const>::execute() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4DA0557: ??? (in /usr/lib/aarch64-linux-gnu/libtbb.so.2)
==23437==    by 0x4D9DA9F: ??? (in /usr/lib/aarch64-linux-gnu/libtbb.so.2)
==23437==    by 0x4BB573B: gtsam::NonlinearFactorGraph::linearize(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4B7A873: gtsam::GaussNewtonOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x4BB9C77: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437==    by 0x135357: NonlinearOptimizerPoint2LinearOptimizationWithHuberTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==    by 0x145FB7: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437==  Address 0x5657840 is 0 bytes after a block of size 16 alloc'd
==23437==    at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==23437== 

As for why the compiler-generated copy constructor is not sufficient for the simple return-by-value functions inside of Rot2, of that I am not yet completely sure, and I am surprised that such a trivial constructor would be problematic. Because compiling with both gcc and clang result in the same misbehavior on aarch64, it seems like the root cause of it may be something in the ARM64 implementation. Seems sort of like copy elision in return value optimization might be the culprit, and that user definition of the copy constructor makes it "non-trivial", avoiding such elision as described in https://developer.arm.com/documentation/ihi0059/latest , section GC++ABI 3.1.4 Return Values:

GC++ABI 3.1.4 Return Values

When a return value has a non-trivial copy constructor or destructor, the address of the caller-allocated temporary is passed in the Indirect Result Location Register (r8) rather than as an implicit first parameter before the this parameter and user parameters.
jaelrod commented 3 years ago

And yes, this is on JetPack, boost version is 1.65.1

ProfFan commented 3 years ago

Interesting find! I think this deserves a more formal investigation, but could you put up a quick PR to fix this first? Thanks a lot!

dellaert commented 3 years ago

@ProfFan I think a PR is pre-mature. We first need to find the root cause, as per the discussion on the user group (please read up on it :-))

jaelrod commented 3 years ago

@dellaert, @ProfFan

I am not sure how much more there is to uncover here. I reverted all local changes with the exception of the addition of the copy constructor I added, Rot2(const Rot2& r), and all of the unit tests on my end are now passing on my Jetson (the ones described as having some BetweenFactor<Pose2> chains constrained by some custom derived unary factors along the way).

I am rebuilding everything from scratch again to re-run make check on my Jetson to see if the rest of the gtsam tests also pass. My plan is to capture the disassembly of a simple function returning a Rot2 by-value (Rot2::atan2(double y, double x) looks promising), remove the user-defined copy constructor, recompile the code, and then compare the resulting assembly instructions for the same function, in hopes to identify how exactly the implicit compiler-generated copy constructor is misbehaving for the return-by-value functions in Rot2, but I think ultimately that misbehavior is not the responsibility of gtsam.

I will report back with the assembly instructions with & without the user-defined copy constructor for Rot2, but probably it makes sense to go ahead and send a PR for this and then just keep an eye out for similar issues in aarch64 (and other platforms).

jaelrod commented 3 years ago

Jetson is taking forever to build the rest of the tests as usual, but here are the resulting disassembly in amd64. It has been some time since I have worked with assembly and have never actually worked with aarch64 instruction set yet, but you can see in the objdump from amd64 a hint that I am on the right track with respect to return value optimization: you can see that without the user-defined copy constructor, the function Rot2::atan2 does not push/pop the stack, yet when the user-defined constructor is there, it does push %rbx and then pop %rbx before ret, suggesting that copy elision is being performed as a part of return value optimization in the absence of a non-trivial copy constructor. My thought is that probably this copy elision is happening in both platforms is not being handled correctly in the aarch64 implementation of the standard.

AMD64 -- implicit `Rot2(const Rot2& r)`
00000000000001f0 <_ZN5gtsam4Rot25atan2Edd>:
 1f0:   48 83 ec 28             sub    $0x28,%rsp
 1f4:   c5 f1 14 c0             vunpcklpd %xmm0,%xmm1,%xmm0
 1f8:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
 1ff:   00 00 
 201:   48 89 44 24 18          mov    %rax,0x18(%rsp)
 206:   31 c0                   xor    %eax,%eax
 208:   48 89 e7                mov    %rsp,%rdi
 20b:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
 210:   e8 00 00 00 00          callq  215 <_ZN5gtsam4Rot25atan2Edd+0x25>
 215:   48 8b 10                mov    (%rax),%rdx
 218:   48 8b 40 08             mov    0x8(%rax),%rax
 21c:   c4 e1 f9 6e c8          vmovq  %rax,%xmm1
 221:   48 8b 44 24 18          mov    0x18(%rsp),%rax
 226:   64 48 33 04 25 28 00    xor    %fs:0x28,%rax
 22d:   00 00 
 22f:   75 0a                   jne    23b <_ZN5gtsam4Rot25atan2Edd+0x4b>
 231:   c4 e1 f9 6e c2          vmovq  %rdx,%xmm0
 236:   48 83 c4 28             add    $0x28,%rsp
 23a:   c3                      retq   
 23b:   e8 00 00 00 00          callq  240 <_ZN5gtsam4Rot26ExpmapERKN5Eigen6MatrixIdLi1ELi1ELi0ELi1ELi1EEENS_16OptionalJacobianILi1ELi1EEE>

AMD64 -- user-defined `Rot2(const Rot2& r)`
0000000000000210 <_ZN5gtsam4Rot25atan2Edd>:
 210:   53                      push   %rbx
 211:   c5 f1 14 c0             vunpcklpd %xmm0,%xmm1,%xmm0
 215:   48 89 fb                mov    %rdi,%rbx
 218:   48 83 ec 20             sub    $0x20,%rsp
 21c:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
 223:   00 00 
 225:   48 89 44 24 18          mov    %rax,0x18(%rsp)
 22a:   31 c0                   xor    %eax,%eax
 22c:   48 89 e7                mov    %rsp,%rdi
 22f:   c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
 234:   e8 00 00 00 00          callq  239 <_ZN5gtsam4Rot25atan2Edd+0x29>
 239:   48 8b 54 24 18          mov    0x18(%rsp),%rdx
 23e:   64 48 33 14 25 28 00    xor    %fs:0x28,%rdx
 245:   00 00 
 247:   c5 f9 10 00             vmovupd (%rax),%xmm0
 24b:   c5 f8 11 03             vmovups %xmm0,(%rbx)
 24f:   75 09                   jne    25a <_ZN5gtsam4Rot25atan2Edd+0x4a>
 251:   48 83 c4 20             add    $0x20,%rsp
 255:   48 89 d8                mov    %rbx,%rax
 258:   5b                      pop    %rbx
 259:   c3                      retq   
 25a:   e8 00 00 00 00          callq  25f <_ZN5gtsam4Rot25atan2Edd+0x4f>
 25f:   90                      nop
jaelrod commented 3 years ago

100% of make check passes on Jetson.

I also just re-ran make testNonlinearOptimizer.valgrind and it has nothing interesting to report. Something I was playing around with before finding the real issue must have been somehow cached in a build output somewhere, resulting in those last two dangling memory errors.

Will get that additional disassembly for completeness and then send a PR.

dellaert commented 3 years ago

wow, awesome!

jaelrod commented 3 years ago

@dellaert Thanks!

Below are the resulting variations of aarch64 instructions for the same function. I mentioned above that I am no expert in assembly, and certainly not in aarch64 flavor, so anyone feel free to correct me, but it does seem to at least share a similar difference pointed out above between the amd64 instruction variations that hint at copy elision: when implicit copy constructor is in use, the instructions stp & str push x29, x30, and x19 onto the stack, but when user-defined copy constructor is there, that str is instead a stp resulting in an additional pushing of x20 onto the stack; the same can be seen of popping using a second ldp instead of ldr.

At this point, I don't think a super deep dive into where the copy eliding version is going wrong is warranted, given that tests are passing in both gtsam and my own project.

ARM64 -- implicit `Rot2(const Rot2& r)`
00000000000001b8 <_ZN5gtsam4Rot25atan2Edd>:
 1b8:   a9bc7bfd    stp x29, x30, [sp, #-64]!
 1bc:   910003fd    mov x29, sp
 1c0:   f9000bf3    str x19, [sp, #16]
 1c4:   90000013    adrp    x19, 0 <__stack_chk_guard>
 1c8:   9100a3a0    add x0, x29, #0x28
 1cc:   f9400273    ldr x19, [x19]
 1d0:   6d0283a1    stp d1, d0, [x29, #40]
 1d4:   f9400261    ldr x1, [x19]
 1d8:   f9001fa1    str x1, [x29, #56]
 1dc:   d2800001    mov x1, #0x0                    // #0
 1e0:   94000000    bl  f0 <_ZN5gtsam4Rot29normalizeEv>
 1e4:   a9400001    ldp x1, x0, [x0]
 1e8:   9e670020    fmov    d0, x1
 1ec:   9e670001    fmov    d1, x0
 1f0:   f9401fa1    ldr x1, [x29, #56]
 1f4:   f9400260    ldr x0, [x19]
 1f8:   ca000020    eor x0, x1, x0
 1fc:   b5000080    cbnz    x0, 20c <_ZN5gtsam4Rot25atan2Edd+0x54>
 200:   f9400bf3    ldr x19, [sp, #16]
 204:   a8c47bfd    ldp x29, x30, [sp], #64
 208:   d65f03c0    ret
 20c:   94000000    bl  0 <__stack_chk_fail>

ARM64 -- user-defined `Rot2(const Rot2& r)`
00000000000001c0 <_ZN5gtsam4Rot25atan2Edd>:
 1c0:   a9bc7bfd    stp x29, x30, [sp, #-64]!
 1c4:   910003fd    mov x29, sp
 1c8:   a90153f3    stp x19, x20, [sp, #16]
 1cc:   90000013    adrp    x19, 0 <__stack_chk_guard>
 1d0:   aa0803f4    mov x20, x8
 1d4:   9100a3a0    add x0, x29, #0x28
 1d8:   f9400273    ldr x19, [x19]
 1dc:   6d0283a1    stp d1, d0, [x29, #40]
 1e0:   f9400261    ldr x1, [x19]
 1e4:   f9001fa1    str x1, [x29, #56]
 1e8:   d2800001    mov x1, #0x0                    // #0
 1ec:   94000000    bl  f0 <_ZN5gtsam4Rot29normalizeEv>
 1f0:   3dc00000    ldr q0, [x0]
 1f4:   f9401fa0    ldr x0, [x29, #56]
 1f8:   f9400261    ldr x1, [x19]
 1fc:   ca010001    eor x1, x0, x1
 200:   3d800280    str q0, [x20]
 204:   b50000a1    cbnz    x1, 218 <_ZN5gtsam4Rot25atan2Edd+0x58>
 208:   aa1403e0    mov x0, x20
 20c:   a94153f3    ldp x19, x20, [sp, #16]
 210:   a8c47bfd    ldp x29, x30, [sp], #64
 214:   d65f03c0    ret
 218:   94000000    bl  0 <__stack_chk_fail>
 21c:   d503201f    nop
jaelrod commented 3 years ago

https://github.com/borglab/gtsam/pull/806

(seems I am unable to create the link between issue & PR)

varunagrawal commented 2 years ago

@jaelrod you can link from the drop down in the right side. Closing since this has been resolved!

asasine commented 7 months ago

@jaelrod pointed out something interesting in their suggestion to update this line,

https://github.com/borglab/gtsam/blob/aec2cf06a5501840f0302344bca03c197522a335/gtsam/geometry/Rot2.h#L110

to:

Rot2 inverse() const { return fromCosSin(c_, -s_);}

The suggested form calls fromCosSin, which normalizes the value (the header indicates it shouldn't, created #1672 for that). The current form calls the Rot2(double, double) constructor, which does not perform normalization.

Given this comment:

I haven't 100% figured out why, but Rot2 inverse() in geometry/Rot2.h was returning an inverse with incorrect heading angle when inverting Pose2 as a part of factorization of BetweenFactor<Pose2>.

There might be some assumptions in BetweenFactor that rely on Rot2::inverse returning a normalized Rot2. Could that be the root cause for this?

I'm unsure how the introduction of a user-defined copy ctor made a difference here. My best guess is similar to jaelrod's: before, the implementation before the PR was performing URVO copy elision (mandatory as of C++17). Now, it no longer qualifies for copy elision per the spec (the return type of Rot2::normalize is Rot2&, which is not the same class type as the Rot2::fromCosSin return value), so the user-defined copy constructor is called.