Open jaelrod opened 3 years ago
Somehow (I think it is maybe happening when the LM optimizer is initialized) the left operand of Rot2*Rot2
has an uninitialized c_
and s_
during an invocation before the optimizer takes its first iteration. I think what's happening is that when this uninitialized rotation gets inverted and multiplied during the optimization iterations, the cosine and/or sine of the rotation component of pose estimates becomes extremely positive or negative, the result of which is atan2()
for the pose heading angle returning one of its inflection points, as documented in cppreference. Have not figured out from where this uninitialized rotation is coming, but given that reproducing it repeatedly yields wildly different figures for c_
and s_
for that rotation that shows up before the optimization, that does appear to be what is happening.
Debug output from Rot2 before optimizer's first linearization:
Rot2::operator*(), c_: 3.03428333399e-86, s_: 1.14371993161e-71
Initial values
Values with 1 values:
Value 0: (gtsam::Pose2)
(10.0808679401, 9.99346051488, 0.428400412304)
Yes I agree this could be uninitialized memory. Could you run the unit tests with clang sanitizer on?
Despite having clang installed on my Jetson, it looks like GCC is the compiler in use both when building my entire project as well as when building only gtsam with make check
.
Switched to use clang
and have:
clang: error: the clang compiler does not support '-march=native'
Is disabling that going to cause other issues in aarch64..?
Interesting that -march=native
is unsupported, let me try on my Jetson Nano.
(Currently building it on my Jetson with GTSAM_BUILD_WITH_MARCH_NATIVE OFF. Build seems to be proceeding, but it takes long enough to complete that I thought I would go ahead and mention that it is currently underway over here with
-- C compilation flags : -fsanitize=memory -O3 -DNDEBUG
-- C++ compilation flags : -fsanitize=memory -O3 -DNDEBUG
)
Maybe you can try cross compiling on PC. The Jetson CPU cores are very slow.
*also compiling on my Jetson Nano here
Any hints out of running make testXXX.valgrind
on a native aarch64 system?
Things seem to have successfully compiled on my Jetson using clang-6 with -fsanitize=memory
on, but I don't really see anything different when running gtsam tests. Any luck on your side, @ProfFan ?
@jiblancoc thanks for the suggestion! I am still getting re-acquainted with C++ so did not know about valgrind. Running make testNonlinearOptimizer.valgrind
on amd64 looks pretty clean (some of the tests fail but no memory errors reported). Running it in aarch64 reports loads of issues looking like out-of-bounds access (both writes & reads), some of which seem to be fairly pertinent (these are in the order in which they appear):
==2475== Invalid write of size 8
==2475== at 0x4A5227C: gtsam::Pose2::equals(gtsam::Pose2 const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4BB4507: gtsam::Values::equals(gtsam::Values const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4342C7: bool gtsam::assert_equal<gtsam::Values>(gtsam::Values const&, gtsam::Values const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x42168B: NonlinearOptimizerFactorizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x444E43: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== Address 0x1ffefffa60 is on thread 1's stack
==2475== 48 bytes below stack pointer
==2475== Invalid write of size 8
==2475== at 0x4B64450: gtsam::VectorValues::equals(gtsam::VectorValues const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x43561B: bool gtsam::assert_equal<gtsam::VectorValues>(gtsam::VectorValues const&, gtsam::VectorValues const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x424633: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== Address 0x1ffeffef40 is on thread 1's stack
==2475== 48 bytes below stack pointer
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x4A3A18C: gtsam::equal_with_abs_tol(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&, Eigen::Matrix<double, -1, 1, 0, -1, 1> const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B644C7: gtsam::VectorValues::equals(gtsam::VectorValues const&, double) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x43561B: bool gtsam::assert_equal<gtsam::VectorValues>(gtsam::VectorValues const&, gtsam::VectorValues const&, double) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x424633: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x4BB0D2C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x4BB0D2C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x4BB0D30: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x4BB0D3C: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x511ED5C: cos (s_sin.c:559)
==2475== by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Conditional jump or move depends on uninitialised value(s)
==2475== at 0x511EDA8: cos (s_sin.c:562)
==2475== by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475==
==2475== Use of uninitialised value of size 8
==2475== at 0x511EE28: do_cos (s_sin.c:165)
==2475== by 0x511EE28: cos (s_sin.c:564)
==2475== by 0x4A52887: gtsam::Pose2::ChartAtOrigin::Retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x43C84F: gtsam::LieGroup<gtsam::Pose2, 3>::retract(Eigen::Matrix<double, 3, 1, 0, 3, 1> const&, gtsam::OptionalJacobian<3, 3>, gtsam::OptionalJacobian<3, 3>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x43C39F: gtsam::GenericValue<gtsam::Pose2>::retract_(Eigen::Matrix<double, -1, 1, 0, -1, 1> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4BB3E03: gtsam::Values::Values(gtsam::Values const&, gtsam::VectorValues const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6D3E3: gtsam::DoglegOptimizerImpl::IterationResult gtsam::DoglegOptimizerImpl::Iterate<gtsam::GaussianBayesTree, gtsam::NonlinearFactorGraph, gtsam::Values>(double, gtsam::DoglegOptimizerImpl::TrustRegionAdaptationMode, gtsam::VectorValues const&, gtsam::VectorValues const&, gtsam::GaussianBayesTree const&, gtsam::NonlinearFactorGraph const&, gtsam::Values const&, double, bool) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4B6C193: gtsam::DoglegOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x4BB0B9B: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==2475== by 0x424937: NonlinearOptimizerMoreOptimizationTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4448FB: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
Even this very very basic test TEST( NonlinearOptimizer, paramsEquals )
experiences an out-of-bounds error:
==2475== Invalid write of size 8
==2475== at 0x41B85C: NonlinearOptimizerparamsEqualsTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x444E43: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== by 0x4330BF: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==2475== Address 0x1ffeffff40 is on thread 1's stack
==2475== 96 bytes below stack pointer
Thanks for the report! Could you also run valgrind with this: https://www.valgrind.org/docs/manual/mc-manual.html#opt.track-origins
BTW, are you using JetPack? I think JetPack is still on Ubuntu Bionic whose Boost version has some bugs that prevents ubsan
and asan
from working. (I tried and failed on my Jetson Nano).
Excellent to know about track-origins, thanks for highlighting that option! I will definitely have to try that out.
After much tedious debugging, I found that the functions returning by-value instances of Rot2 were the culprit of the garbage values I was seeing. Values of cosine & sine of rotation looked how they should during construction of the instances to be returned, but when the value was actually returned, those good values disappeared and the calling code was left with junk. Adding a trivial Rot2 copy constructor as a user-defined constructor, Rot2(const Rot2& rot) : c_(rot.c_), s_(rot.s_) {}
, does result in it getting called when returning Rot2 by value from a function, and it does actually resolve the uninitialized values issue (without needing my cosine & sine normalization band-aid fix from above). The same test make testNonlinearOptimizer.valgrind
drops from loads of uninitialized memory errors to only the following two:
==23437== Invalid read of size 8
==23437== at 0x144C4C: gtsam::BetweenFactor<Eigen::Matrix<double, 2, 1, 0, 2, 1> >::evaluateError(Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x143B77: gtsam::NoiseModelFactor2<Eigen::Matrix<double, 2, 1, 0, 2, 1>, Eigen::Matrix<double, 2, 1, 0, 2, 1> >::unwhitenedError(gtsam::Values const&, boost::optional<std::vector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, std::allocator<Eigen::Matrix<double, -1, -1, 0, -1, -1> > >&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x4BB260F: gtsam::NoiseModelFactor::error(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4BB4C07: gtsam::NonlinearFactorGraph::error(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4B7B807: gtsam::GaussNewtonOptimizer::GaussNewtonOptimizer(gtsam::NonlinearFactorGraph const&, gtsam::Values const&, gtsam::GaussNewtonParams const&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x13534F: NonlinearOptimizerPoint2LinearOptimizationWithHuberTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x145FB7: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x12A5B3: main (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== Address 0x5656ce0 is 0 bytes after a block of size 16 alloc'd
==23437== at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==23437==
==23437== Invalid read of size 8
==23437== at 0x144C4C: gtsam::BetweenFactor<Eigen::Matrix<double, 2, 1, 0, 2, 1> >::evaluateError(Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, Eigen::Matrix<double, 2, 1, 0, 2, 1> const&, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>, boost::optional<Eigen::Matrix<double, -1, -1, 0, -1, -1>&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x143BDB: gtsam::NoiseModelFactor2<Eigen::Matrix<double, 2, 1, 0, 2, 1>, Eigen::Matrix<double, 2, 1, 0, 2, 1> >::unwhitenedError(gtsam::Values const&, boost::optional<std::vector<Eigen::Matrix<double, -1, -1, 0, -1, -1>, std::allocator<Eigen::Matrix<double, -1, -1, 0, -1, -1> > >&>) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x4BB2893: gtsam::NoiseModelFactor::linearize(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4BB420F: gtsam::(anonymous namespace)::_LinearizeOneFactor::operator()(tbb::blocked_range<unsigned long> const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4BB440B: tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, gtsam::(anonymous namespace)::_LinearizeOneFactor, tbb::auto_partitioner const>::execute() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4DA0557: ??? (in /usr/lib/aarch64-linux-gnu/libtbb.so.2)
==23437== by 0x4D9DA9F: ??? (in /usr/lib/aarch64-linux-gnu/libtbb.so.2)
==23437== by 0x4BB573B: gtsam::NonlinearFactorGraph::linearize(gtsam::Values const&) const (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4B7A873: gtsam::GaussNewtonOptimizer::iterate() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x4BB9C77: gtsam::NonlinearOptimizer::defaultOptimize() (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/gtsam/libgtsam.so.4.1.0)
==23437== by 0x135357: NonlinearOptimizerPoint2LinearOptimizationWithHuberTest::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== by 0x145FB7: TestRegistry::run(TestResult&) (in /home/adsasine/code/Azure-Compute-Robotics/src/robot/common/gtsam/build/tests/testNonlinearOptimizer)
==23437== Address 0x5657840 is 0 bytes after a block of size 16 alloc'd
==23437== at 0x4845BFC: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==23437==
As for why the compiler-generated copy constructor is not sufficient for the simple return-by-value functions inside of Rot2, of that I am not yet completely sure, and I am surprised that such a trivial constructor would be problematic. Because compiling with both gcc and clang result in the same misbehavior on aarch64, it seems like the root cause of it may be something in the ARM64 implementation. Seems sort of like copy elision in return value optimization might be the culprit, and that user definition of the copy constructor makes it "non-trivial", avoiding such elision as described in https://developer.arm.com/documentation/ihi0059/latest , section GC++ABI 3.1.4 Return Values:
GC++ABI 3.1.4 Return Values
When a return value has a non-trivial copy constructor or destructor, the address of the caller-allocated temporary is passed in the Indirect Result Location Register (r8) rather than as an implicit first parameter before the this parameter and user parameters.
And yes, this is on JetPack, boost version is 1.65.1
Interesting find! I think this deserves a more formal investigation, but could you put up a quick PR to fix this first? Thanks a lot!
@ProfFan I think a PR is pre-mature. We first need to find the root cause, as per the discussion on the user group (please read up on it :-))
@dellaert, @ProfFan
I am not sure how much more there is to uncover here. I reverted all local changes with the exception of the addition of the copy constructor I added, Rot2(const Rot2& r)
, and all of the unit tests on my end are now passing on my Jetson (the ones described as having some BetweenFactor<Pose2>
chains constrained by some custom derived unary factors along the way).
I am rebuilding everything from scratch again to re-run make check
on my Jetson to see if the rest of the gtsam tests also pass. My plan is to capture the disassembly of a simple function returning a Rot2 by-value (Rot2::atan2(double y, double x)
looks promising), remove the user-defined copy constructor, recompile the code, and then compare the resulting assembly instructions for the same function, in hopes to identify how exactly the implicit compiler-generated copy constructor is misbehaving for the return-by-value functions in Rot2
, but I think ultimately that misbehavior is not the responsibility of gtsam.
I will report back with the assembly instructions with & without the user-defined copy constructor for Rot2
, but probably it makes sense to go ahead and send a PR for this and then just keep an eye out for similar issues in aarch64 (and other platforms).
Jetson is taking forever to build the rest of the tests as usual, but here are the resulting disassembly in amd64. It has been some time since I have worked with assembly and have never actually worked with aarch64 instruction set yet, but you can see in the objdump
from amd64 a hint that I am on the right track with respect to return value optimization: you can see that without the user-defined copy constructor, the function Rot2::atan2
does not push/pop the stack, yet when the user-defined constructor is there, it does push %rbx
and then pop %rbx
before ret
, suggesting that copy elision is being performed as a part of return value optimization in the absence of a non-trivial copy constructor. My thought is that probably this copy elision is happening in both platforms is not being handled correctly in the aarch64 implementation of the standard.
AMD64 -- implicit `Rot2(const Rot2& r)`
00000000000001f0 <_ZN5gtsam4Rot25atan2Edd>:
1f0: 48 83 ec 28 sub $0x28,%rsp
1f4: c5 f1 14 c0 vunpcklpd %xmm0,%xmm1,%xmm0
1f8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
1ff: 00 00
201: 48 89 44 24 18 mov %rax,0x18(%rsp)
206: 31 c0 xor %eax,%eax
208: 48 89 e7 mov %rsp,%rdi
20b: c5 f8 29 04 24 vmovaps %xmm0,(%rsp)
210: e8 00 00 00 00 callq 215 <_ZN5gtsam4Rot25atan2Edd+0x25>
215: 48 8b 10 mov (%rax),%rdx
218: 48 8b 40 08 mov 0x8(%rax),%rax
21c: c4 e1 f9 6e c8 vmovq %rax,%xmm1
221: 48 8b 44 24 18 mov 0x18(%rsp),%rax
226: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
22d: 00 00
22f: 75 0a jne 23b <_ZN5gtsam4Rot25atan2Edd+0x4b>
231: c4 e1 f9 6e c2 vmovq %rdx,%xmm0
236: 48 83 c4 28 add $0x28,%rsp
23a: c3 retq
23b: e8 00 00 00 00 callq 240 <_ZN5gtsam4Rot26ExpmapERKN5Eigen6MatrixIdLi1ELi1ELi0ELi1ELi1EEENS_16OptionalJacobianILi1ELi1EEE>
AMD64 -- user-defined `Rot2(const Rot2& r)`
0000000000000210 <_ZN5gtsam4Rot25atan2Edd>:
210: 53 push %rbx
211: c5 f1 14 c0 vunpcklpd %xmm0,%xmm1,%xmm0
215: 48 89 fb mov %rdi,%rbx
218: 48 83 ec 20 sub $0x20,%rsp
21c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
223: 00 00
225: 48 89 44 24 18 mov %rax,0x18(%rsp)
22a: 31 c0 xor %eax,%eax
22c: 48 89 e7 mov %rsp,%rdi
22f: c5 f8 29 04 24 vmovaps %xmm0,(%rsp)
234: e8 00 00 00 00 callq 239 <_ZN5gtsam4Rot25atan2Edd+0x29>
239: 48 8b 54 24 18 mov 0x18(%rsp),%rdx
23e: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
245: 00 00
247: c5 f9 10 00 vmovupd (%rax),%xmm0
24b: c5 f8 11 03 vmovups %xmm0,(%rbx)
24f: 75 09 jne 25a <_ZN5gtsam4Rot25atan2Edd+0x4a>
251: 48 83 c4 20 add $0x20,%rsp
255: 48 89 d8 mov %rbx,%rax
258: 5b pop %rbx
259: c3 retq
25a: e8 00 00 00 00 callq 25f <_ZN5gtsam4Rot25atan2Edd+0x4f>
25f: 90 nop
100% of make check
passes on Jetson.
I also just re-ran make testNonlinearOptimizer.valgrind
and it has nothing interesting to report. Something I was playing around with before finding the real issue must have been somehow cached in a build output somewhere, resulting in those last two dangling memory errors.
Will get that additional disassembly for completeness and then send a PR.
wow, awesome!
@dellaert Thanks!
Below are the resulting variations of aarch64 instructions for the same function. I mentioned above that I am no expert in assembly, and certainly not in aarch64 flavor, so anyone feel free to correct me, but it does seem to at least share a similar difference pointed out above between the amd64 instruction variations that hint at copy elision: when implicit copy constructor is in use, the instructions stp
& str
push x29
, x30
, and x19
onto the stack, but when user-defined copy constructor is there, that str
is instead a stp
resulting in an additional pushing of x20
onto the stack; the same can be seen of popping using a second ldp
instead of ldr
.
At this point, I don't think a super deep dive into where the copy eliding version is going wrong is warranted, given that tests are passing in both gtsam and my own project.
ARM64 -- implicit `Rot2(const Rot2& r)`
00000000000001b8 <_ZN5gtsam4Rot25atan2Edd>:
1b8: a9bc7bfd stp x29, x30, [sp, #-64]!
1bc: 910003fd mov x29, sp
1c0: f9000bf3 str x19, [sp, #16]
1c4: 90000013 adrp x19, 0 <__stack_chk_guard>
1c8: 9100a3a0 add x0, x29, #0x28
1cc: f9400273 ldr x19, [x19]
1d0: 6d0283a1 stp d1, d0, [x29, #40]
1d4: f9400261 ldr x1, [x19]
1d8: f9001fa1 str x1, [x29, #56]
1dc: d2800001 mov x1, #0x0 // #0
1e0: 94000000 bl f0 <_ZN5gtsam4Rot29normalizeEv>
1e4: a9400001 ldp x1, x0, [x0]
1e8: 9e670020 fmov d0, x1
1ec: 9e670001 fmov d1, x0
1f0: f9401fa1 ldr x1, [x29, #56]
1f4: f9400260 ldr x0, [x19]
1f8: ca000020 eor x0, x1, x0
1fc: b5000080 cbnz x0, 20c <_ZN5gtsam4Rot25atan2Edd+0x54>
200: f9400bf3 ldr x19, [sp, #16]
204: a8c47bfd ldp x29, x30, [sp], #64
208: d65f03c0 ret
20c: 94000000 bl 0 <__stack_chk_fail>
ARM64 -- user-defined `Rot2(const Rot2& r)`
00000000000001c0 <_ZN5gtsam4Rot25atan2Edd>:
1c0: a9bc7bfd stp x29, x30, [sp, #-64]!
1c4: 910003fd mov x29, sp
1c8: a90153f3 stp x19, x20, [sp, #16]
1cc: 90000013 adrp x19, 0 <__stack_chk_guard>
1d0: aa0803f4 mov x20, x8
1d4: 9100a3a0 add x0, x29, #0x28
1d8: f9400273 ldr x19, [x19]
1dc: 6d0283a1 stp d1, d0, [x29, #40]
1e0: f9400261 ldr x1, [x19]
1e4: f9001fa1 str x1, [x29, #56]
1e8: d2800001 mov x1, #0x0 // #0
1ec: 94000000 bl f0 <_ZN5gtsam4Rot29normalizeEv>
1f0: 3dc00000 ldr q0, [x0]
1f4: f9401fa0 ldr x0, [x29, #56]
1f8: f9400261 ldr x1, [x19]
1fc: ca010001 eor x1, x0, x1
200: 3d800280 str q0, [x20]
204: b50000a1 cbnz x1, 218 <_ZN5gtsam4Rot25atan2Edd+0x58>
208: aa1403e0 mov x0, x20
20c: a94153f3 ldp x19, x20, [sp, #16]
210: a8c47bfd ldp x29, x30, [sp], #64
214: d65f03c0 ret
218: 94000000 bl 0 <__stack_chk_fail>
21c: d503201f nop
https://github.com/borglab/gtsam/pull/806
(seems I am unable to create the link between issue & PR)
@jaelrod you can link from the drop down in the right side. Closing since this has been resolved!
@jaelrod pointed out something interesting in their suggestion to update this line,
to:
Rot2 inverse() const { return fromCosSin(c_, -s_);}
The suggested form calls fromCosSin
, which normalizes the value (the header indicates it shouldn't, created #1672 for that). The current form calls the Rot2(double, double)
constructor, which does not perform normalization.
Given this comment:
I haven't 100% figured out why, but
Rot2 inverse()
in geometry/Rot2.h was returning an inverse with incorrect heading angle when inverting Pose2 as a part of factorization ofBetweenFactor<Pose2>
.
There might be some assumptions in BetweenFactor
that rely on Rot2::inverse
returning a normalized Rot2
. Could that be the root cause for this?
I'm unsure how the introduction of a user-defined copy ctor made a difference here. My best guess is similar to jaelrod's: before, the implementation before the PR was performing URVO copy elision (mandatory as of C++17). Now, it no longer qualifies for copy elision per the spec (the return type of Rot2::normalize
is Rot2&
, which is not the same class type as the Rot2::fromCosSin
return value), so the user-defined copy constructor is called.
Description
I have some unit tests that chain together some BetweenFactor using an odometry mock, inject some slightly contradictory supplemental evidence in the form of other factors, such as a unary factor akin to the "GPS-like" factor from LocalizationExample.cpp, and finally use LevenbergMarquardtOptimizer to
optimize()
the factor graph and assert the result.For example, a test might start with some rather noisy mocked odometry moving forward at a consistent speed with a heading of pi/4, qualify the noisy estimates with some very not-noisy GPS-like factors that suggest movement is actually happening in the pi/2 direction, rather than the pi/4 direction, then optimize the graph and make sure that the optimized poses were along the trajectory suggested by the more reliable GPS-like factors.
Such a test and several others work fine on my desktop (x86_64, Ubuntu 18.04) but fail on my Jetson Xavier NX (aarch64, Ubuntu 18.04).
Checking out borglab gtsam source from github and running
make check
to invoke unit tests on the aarch64 device results in many failures, which I expect are related to my own, the following example of which I think is similar to one of the more basic included examples in gtsam :I am in the process of debugging gtsam source through verbose optimizer output and will provide updates as I make progress, but I am suspicious of Eigen anti-aliasing or use of
auto
keyword and type resolution on aarch64.One thing I have observed so far only in aarch64 is:
Another thing I have noticed in aarch64 is that the initial total error of the same system seems much lower than that of amd64, possibly resulting in less/no optimization iterations.
Update 1: Seems like the calculation of error of
BetweenFactor<Pose2>
is indeed incorrect on aarch64.Debug output from amd64 of BetweenFactor::evaluateError for the very first error calculation based on estimates:
Debug output from aarch64 of BetweenFactor::evaluateError for almost the exact same poses:
Playing around inside of
BetweenFactor
andLieGroup
but haven't had much luck so far.Update 2: I haven't 100% figured out why, but
Rot2 inverse()
in geometry/Rot2.h was returning an inverse with incorrect heading angle when inverting Pose2 as a part of factorization ofBetweenFactor<Pose2>
.Changing this line, https://github.com/borglab/gtsam/blob/aec2cf06a5501840f0302344bca03c197522a335/gtsam/geometry/Rot2.h#L110 to read the following fixes at least that issue (seems like there are more still to find):
Rot2 inverse() const { return fromCosSin(c_, -s_);}
Like I said, not yet sure why this makes a difference, as all fromCosSin appears to be doing is normalizing c & s, and those should have already been normalized...
Update 3: Also, Rot2::operator*(const Rot2& R) does not behave as expected in aarch64. The following seems to relieve the problem: return fromAngle(theta() + R.theta());
Although, again, I am not sure why this works, as making use of the trig identity for cos(th1+th2) and sin(th1+th2) as is done in master branch should be identical in terms of resulting behavior...
Steps to reproduce
BetweenFactor<Pose2>
and some initial estimatesExpected behavior
Factor graph optimizes in the same way it does on amd64.
Environment
Jetson Xavier NX, Ubuntu 18.04