borglab / gtsam

GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices.
http://gtsam.org
Other
2.63k stars 767 forks source link

gtsam doesn't build on i386 or armhf #1605

Open dkogan opened 1 year ago

dkogan commented 1 year ago

Hi. I pushed gtsam into Debian, and I see (among other things) that it doesn't build on armhf and i386. Can one of you please take a look? It'll take yall much less time to detangle this than it would take me.

Here's the i386 build log: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2%7E9%2Bdfsg-4&stamp=1692035308&raw=0

The punchline is that when building gtsam/linear/tests/testSparseEigen.cpp this happens:

In file included from ./gtsam/base/FastSet.h:28,
                 from ./gtsam/inference/Key.h:22,
                 from ./gtsam/inference/Ordering.h:23,
                 from ./gtsam/inference/EliminateableFactorGraph.h:26,
                 from ./gtsam/linear/GaussianFactorGraph.h:24,
                 from gtsam/linear/tests/testSparseEigen.cpp:21:
./gtsam/base/Testable.h: In instantiation of 'bool gtsam::assert_equal(const V&, const V&, double) [with V = int]':
gtsam/linear/tests/testSparseEigen.cpp:44:3:   required from here
./gtsam/base/Testable.h:99:26: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
   99 |     if (traits<V>::Equals(actual,expected, tol))
      |         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
./gtsam/base/Testable.h:102:21: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
  102 |     traits<V>::Print(expected,"expected:\n");
      |     ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
./gtsam/base/Testable.h:103:21: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
  103 |     traits<V>::Print(actual,"actual:\n");

It's reproducible standalone, without cmake like this:

< cmake/dllexport.h.in sed \
' s/#cmakedefine/#define/g;
  s/@library_name@/GTSAM/g;
' > gtsam/dllexport.h

< gtsam/config.h.in sed \
' s/#cmakedefine/#define/g;
  s/@GTSAM_VERSION_MAJOR@/4/g;
  s/@GTSAM_VERSION_MINOR@/2/g;
  s/@GTSAM_VERSION_PATCH@/0/g;
  s/@GTSAM_VERSION_NUMERIC@/40200/g;
  s/@GTSAM_VERSION_STRING@/4.2a9/g;
  s/@GTSAM_EIGEN_VERSION_WORLD@/3/g;
  s/@GTSAM_EIGEN_VERSION_MAJOR@/4/g;
  s/.*define .*_USE.*_MKL.*//g;
  s/.*define GTSAM_EIGEN_VERSION_MINOR.*//g;
  s/.*define GTSAM_ALLOCATOR_BOOSTPOOL.*//g;
  s/.*define GTSAM_ALLOCATOR_STL.*//g;
  s/.*define GTSAM_SLOW_BUT_CORRECT_BETWEENFACTOR.*//g;
  s/.*define GTSAM_USE_QUATERNIONS.*//g;
' > gtsam/config.h

g++ \
  -DBOOST_ALL_NO_LIB \
  -DBOOST_ATOMIC_DYN_LINK \
  -DBOOST_CHRONO_DYN_LINK \
  -DBOOST_DATE_TIME_DYN_LINK \
  -DBOOST_FILESYSTEM_DYN_LINK \
  -DBOOST_PROGRAM_OPTIONS_DYN_LINK \
  -DBOOST_REGEX_DYN_LINK \
  -DBOOST_SERIALIZATION_DYN_LINK \
  -DBOOST_SYSTEM_DYN_LINK \
  -DBOOST_THREAD_DYN_LINK \
  -DBOOST_TIMER_DYN_LINK \
  -DNDEBUG \
  -I"." \
  -I"CppUnitLite" \
  -isystem /usr/include/eigen3 \
  -fstack-protector-strong \
  -Wformat \
  -Werror=format-security \
  -Wno-deprecated-declarations \
  -Wdate-time \
  -D_FORTIFY_SOURCE=2 \
  -DNDEBUG \
  -Wall \
  -fPIC \
  -Wreturn-local-addr \
  -Werror=return-local-addr \
  -Wreturn-type \
  -Werror=return-type \
  -Wformat \
  -Werror=format-security \
  -Wsuggest-override \
  -Wno-unused-local-typedefs \
  -o /tmp/tst.o \
  -c "gtsam/linear/tests/testSparseEigen.cpp"

Thanks much

ProfFan commented 1 year ago

I need to spin up an armhf box first, thanks for reporting! Also, do you also build the Python wrapper? Currently I think they need to be built together.

dkogan commented 1 year ago

Hi. Thanks for replying

I need to spin up an armhf box first, thanks for reporting!

I see the same behavior on i686 and armhf, and since you probably have an amd64 box, you can more easily reproduce the i686 failure. If you have an amd64 install of debian, install the i686 cross-compiler:

sudo apt install g++-i686-linux-gnu

And then you can compile the offending file as noted in the bug report. But tell it to use the i686 compiler. I see this:

@.***:~/debianstuff/gtsam$ i686-linux-gnu-g++ \
-DBOOST_ALL_NO_LIB \
-DBOOST_ATOMIC_DYN_LINK \ -DBOOST_CHRONO_DYN_LINK \ -DBOOST_DATE_TIME_DYN_LINK \ -DBOOST_FILESYSTEM_DYN_LINK \ -DBOOST_PROGRAM_OPTIONS_DYN_LINK \ -DBOOST_REGEX_DYN_LINK \
-DBOOST_SERIALIZATION_DYN_LINK \ -DBOOST_SYSTEM_DYN_LINK \
-DBOOST_THREAD_DYN_LINK \
-DBOOST_TIMER_DYN_LINK \
-DNDEBUG \
-I"." \
-I"CppUnitLite" \
-isystem /usr/include/eigen3 \
-fstack-protector-strong \
-Wformat \
-Werror=format-security \
-Wno-deprecated-declarations \
-Wdate-time \
-D_FORTIFY_SOURCE=2 \ -DNDEBUG \ -Wall \ -fPIC \ -Wreturn-local-addr \ -Werror=return-local-addr \ -Wreturn-type \ -Werror=return-type \ -Wformat \ -Werror=format-security \ -Wsuggest-override \ -Wno-unused-local-typedefs \ -o /tmp/tst.o \ -c "gtsam/linear/tests/testSparseEigen.cpp" In file included from ./gtsam/base/FastSet.h:28, from ./gtsam/inference/Key.h:22, from ./gtsam/inference/Ordering.h:23, from ./gtsam/inference/EliminateableFactorGraph.h:26, from ./gtsam/linear/GaussianFactorGraph.h:24, from gtsam/linear/tests/testSparseEigen.cpp:21: ./gtsam/base/Testable.h: In instantiation of 'bool gtsam::assert_equal(const V&, const V&, double) [with V = int]': gtsam/linear/tests/testSparseEigen.cpp:44:3: required from here ./gtsam/base/Testable.h:99:26: error: incomplete type 'gtsam::traits' used in nested name specifier 99 | if (traits::Equals(actual,expected, tol)) | ~~~^~~~~~~~ ./gtsam/base/Testable.h:102:21: error: incomplete type 'gtsam::traits' used in nested name specifier 102 | traits::Print(expected,"expected:\n"); | ~~~~^~~~~~~~ ./gtsam/base/Testable.h:103:21: error: incomplete type 'gtsam::traits' used in nested name specifier 103 | traits::Print(actual,"actual:\n"); | ~~~~^~~~~~~~

Also, do you also build the Python wrapper? Currently I think they need to be built together.

I do, yes. This creates other problems, but I have workarounds. Once all the things build on all the arches, I'll submit more bug reports to hopefully ingest the fixes into your tree.

ProfFan commented 1 year ago

@dkogan I think you only need to change the EXPECT(assert_equal()) to EXPECT_LONGS_EQUAL()

dkogan commented 1 year ago

Thanks much. Appears to work. Doing another upload; let's see how it does...

dkogan commented 1 year ago

OK, this issue appears to be fixed, but now the tests fail on armhf: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=armhf&ver=4.2%7E9%2Bdfsg-5&stamp=1692186176&raw=0

The other arches still have problems too: https://buildd.debian.org/status/package.php?p=gtsam

I'm going to have to come back to this later. If you want to work on some of these issues in the meantime, that would be great. The package tree is here: https://buildd.debian.org/status/package.php?p=gtsam

ProfFan commented 1 year ago

@dkogan I think we currently only support 64-bit platforms, is this supported on Debian?

dkogan commented 1 year ago

Fan Jiang @.***> writes:

@dkogan I think we currently only support 64-bit platforms, is this supported on Debian?

What is "support"? There's no reason gtsam should be arch-specific: it's a tool to solve a math problem.

The 10 arches on top of the buildd page are the ones that must be functional for this to be included in Debian:

https://buildd.debian.org/status/package.php?p=gtsam

The build issues are varied. armel needs wants to be linked with -latomic. I asked for that in the build, but apparently only some parts of the build respected that setting. This is almost certainly a bug in the gtsam build system:

https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=armel&ver=4.2%7E9%2Bdfsg-5&stamp=1692172353&raw=0

armhf has failing tests:

https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=armhf&ver=4.2%7E9%2Bdfsg-5&stamp=1692186176&raw=0

mips64el has some linker problem I haven't debugged yet:

https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=mips64el&ver=4.2%7E9%2Bdfsg-5&stamp=1692184823&raw=0

mipsel also wants -latomic and apparently the debug symbols are too large to fit into its ELF segments:

https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=mipsel&ver=4.2%7E9%2Bdfsg-5&stamp=1692192262&raw=0

And so on. None of these are inherent limitations in gtsam. It's all fixable, but will take some time. I'll get back to it eventually, but if you want to work on it in the meantime, you can.

dkogan commented 1 year ago

Hi. It looks like the failure mode on i686 and armhf is the same one: the build completes, but the tests fail: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2%7E9%2Bdfsg-5&stamp=1692266835&raw=0

Since you don't need any new hardware on i686, can I get you to look into that? You'll be far more effective at debugging this than I. Thanks.

ProfFan commented 1 year ago

@dkogan I can look at it this week

ProfFan commented 1 year ago

Tagging @jlblancoc for this is a Debian-related issue

jlblancoc commented 1 year ago

First, thanks a lot @dkogan for finally being brave enough to move this forward! :-)

All these errors are caused by assumptions and/or never-happened-yet situations about the different sizes of "int", "long", "size_t" in those "uncommon" archs, it's not related directly with Debian packaging or flags per se...

jlblancoc commented 1 year ago

PS: The comment above was for build errors.

For failing tests, I cannot see a clear reason from the buildd logs. In the past, reasons for tests to fail in other projects on "uncommon" archs have been:

It's still a long track ahead, but it's worth it! :+1:

dkogan commented 1 year ago

Hi. Yes. You could run valgrind or use the Debian build infrastructure and so on. But let's start with i686. This doesn't require anything that we all don't have already, and should be very simple to reproduce. I don't have the cycles currently. Can one of you please look at it?

Thanks.

ProfFan commented 1 year ago

I think I know what is the issue, it's because we expect Key to be the same as Eigen::Index which is 32 bit in some archs. I would say now we have a failure case for #1522

dkogan commented 1 year ago

Fan Jiang @.***> writes:

I think I know what is the issue, it's because we expect Key to be the same as Eigen::Index which is 32 bit in some archs. I would say now we have a failure case for #1522

That looks like a good candidate. Are the patches in that PR more or less done? Should I try applying them to the Debian builds?

ProfFan commented 1 year ago

Fan Jiang @.***> writes: I think I know what is the issue, it's because we expect Key to be the same as Eigen::Index which is 32 bit in some archs. I would say now we have a failure case for #1522 That looks like a good candidate. Are the patches in that PR more or less done? Should I try applying them to the Debian builds?

@dkogan Yes that would be a good test, let's apply and see what happens

dkogan commented 1 year ago

The patches in #1522 fixed most of the test failures but not all. I just tried on armhf, and there's one failure still:

 69/221 Test  #69: testSymbolicFactorGraph ............***Failed    0.03 sec
Not equal:
expected:
: cliques: 4, variables: 6
expected:
- P( e0 l0 b0)
expected:
| - P( s0 | b0 l0)
expected:
| - P( t0 | e0 l0)
expected:
| - P( x0 | e0)
actual:
: cliques: 4, variables: 6
actual:
- P( e0 l0 b0)
actual:
| - P( t0 | e0 l0)
actual:
| - P( x0 | e0)
actual:
| - P( s0 | b0 l0)
./gtsam/symbolic/tests/testSymbolicFactorGraph.cpp:93: Failure: "assert_equal(asiaBayesTree, actual2)" 
There were 1 failures

Yall should test this all yourselves I think. I would start with i686. I suggest using a Debian install on your native arch (presumably amd64), crossing to i686. This is trivial to set up, and works without emulation. Let me know if you need help.

dellaert commented 1 year ago

If you can root-cause it and suggest a fix/PR that I think would be the most helpful. Might be some cross-platform non-determinism that we have not yet encountered.

dkogan commented 1 year ago

This has a lot of gtsam-specific complexity, and you can debug and fix this much faster than I can. Building on i686 is trivial for you to do (as described above), and you should start there. Let me know if you need help setting this up.

ProfFan commented 1 year ago

This seems to be just an order issue in the comparison of Bayes tree, these two trees are equivalent. @dkogan I think you can mask this test in i386 first and we can work on a fix to the comparison

dkogan commented 1 year ago

Fan Jiang @.***> writes:

This seems to be just an order issue in the comparison of Bayes tree, these two trees are equivalent. @dkogan I think you can mask this test in i386 first and we can work on a fix to the comparison

OK. So to be clear I should:

Yes? I'm certain there will be other issues, but this should get us closer.

ProfFan commented 1 year ago

Yes :)

dkogan commented 1 year ago

Alright, I tried it. Build logs are here: https://buildd.debian.org/status/package.php?p=gtsam

Click on the red "Build-Attempted" to get the log for each particular arch. Notes:

The other arches either succeeded, or haven't been built yet.

Can somebody please take a look? Thanks

jlblancoc commented 1 year ago
ProfFan commented 1 year ago

Yep, seems like abuse of size_t when it should be uint64_t

dkogan commented 1 year ago

OK. Once some form of #1625 is merged, I'll pull that into my patch stack.

And now that all the builds are done, I can see that some other arches failed in different ways. I think I fixed what mips64el is complaining about. Can somebody please look at the test failures for hppa and x32:

ProfFan commented 1 year ago

@dkogan It's near conference deadline recently so no bandwidth for me, but after ICRA sub deadline I can look into the Python failures.

dkogan commented 1 year ago

Fan Jiang @.***> writes:

@dkogan It's near conference deadline recently so no bandwidth for me, but after ICRA sub deadline I can look into the Python failures.

I'm not in a rush at all. Is the ICRA deadline coming? I should do something too...

dkogan commented 1 year ago

1625 was just merged, so this round of i386 issues should be fixed. Thanks! The armel, armhf ones would be unaffected though.

berndpfrommer commented 10 months ago

Today our automated packaging broke because of the introduction of libcephes-gtsam.so into the development branch. As the name suggests, it's a private build of libcephes since no debian package exists. Didn't we get rid of a privately built libmetis because it would preclude gtsam from inclusion into debian?

dkogan commented 10 months ago

Hi. If you're including more dependencies you didn't write, it would be great if you had a clear switch in the code to avoid using them.

dkogan commented 10 months ago

Hi. I pushed updated the Debian package to the new 4.2.0 release. And the 32-bit architectures still fail tests. For i386: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2.0%2Bdfsg-1&stamp=1704797861&raw=0

The first failure:

======================================================================
ERROR: test_VisualISAMExample (testEssentialMatrixConstraint.TestVisualISAMExample.test_VisualISAMExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/obj-i686-linux-gnu/python/gtsam/tests/testEssentialMatrixConstraint.py", line 32, in test_VisualISAMExample
    factor = EssentialMatrixConstraint(poseKey1, poseKey2, E, model)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. gtsam.gtsam.EssentialMatrixConstraint(key1: int, key2: int, measuredE: gtsam.gtsam.EssentialMatrix, model: gtsam.gtsam.noiseModel.Base)

Invoked with: 8646911284551352321, 8646911284551352322, R:
 [
    0.969061, 0.218325, 0.115126;
    -0.196438, 0.964625, -0.175815;
    -0.149438, 0.14776, 0.977668
]
d: : 0.333333
-0.666667
 0.666667
, isotropic dim=5 sigma=0.25

This looks to me like it couldn't run the test code, not that it failed some threshold. Can one of you please take a look?

Thanks.

ProfFan commented 10 months ago

@varunagrawal Is libcephes a header-only library? If so maybe it does not need to be built at all. We can make all function inline and remove the lib.

@dkogan You are right, this error is potentially a wrong dynamic_cast. There is a recent issue #1685 about similar stuff. I can look into this error this week.

varunagrawal commented 10 months ago

@varunagrawal Is libcephes a header-only library? If so maybe it does not need to be built at all. We can make all function inline and remove the lib.

Unfortunately it isn't and I don't have the bandwidth to go deep into updating it to be header-only (especially since there is no test suite).

ProfFan commented 10 months ago

@varunagrawal

/**
 * @brief Compute the quantile function of the Chi-Squared distribution.
 *
 * The quantile function of the Chi-squared distribution is the quantile of
 * the specific (inverse) incomplete Gamma distribution.
 *
 * @param dofs Degrees of freedom
 * @param alpha Quantile value
 * @return double
 */
double chi_squared_quantile(const double dofs, const double alpha) {
  return 2 * igami(dofs / 2, alpha);
}

From a search this is the only place using cephes, is it correct?

varunagrawal commented 10 months ago

Yes, at least for now.

varunagrawal commented 10 months ago

I am not sure I understand what the issue with Cephes is here. @dkogan's latest comment seems to be referring to something else.

ProfFan commented 10 months ago

Just thinking about how can we make packaging easier :)

berndpfrommer commented 10 months ago

Sorry I think I somewhat hijacked this issue, which was about gtsam not building on i386/armhf. The point I'm raising about the introduction of libcephes is in so far relevant that it may go counter to @dkogan 's efforts to make gtsam an acceptable Debian package (which the present issue is also concerned with). It was my understanding that shipping binaries of other libraries along with gtsam is less than ideal, see the discussion related to libmetis. But again, sorry for taking this thread off-topic.

varunagrawal commented 10 months ago

Just thinking about how can we make packaging easier :)

The two seem orthogonal? The reasoning for including cephes is to enable functionality (specifically of the GNC optimizer), so sacrificing functionality for the sake of packaging seems to be akin to getting a poor gift but really expensive wrapping paper. There is a single function call but behind the scenes, multiple files and functions are required to make it work.

That being said, if there was an easier way to enable GNC without Cephes, I would do it. Unfortunately, re-implementing code provided by Cephes in a manner that keeps up with the quality of GTSAM is out of the question for me since I have been working on it by myself without any help for many months, and I have other things on my plate. Unless someone is willing to step up and help with development, it feels a bit disingenuous to ask for changes to make others lives easier while simultaneously making mine harder.