borglab / gtsam

GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices.
http://gtsam.org
Other
2.52k stars 751 forks source link

ISAM large graph serialization crash #1434

Open ScottMcMichael opened 1 year ago

ScottMcMichael commented 1 year ago

Description

Serialization of an ISAM2 solver object reliably fails after the graph reaches a certain size.

Steps to reproduce

Download data file here: https://drive.google.com/file/d/1bVyEQc3ogHmfuGTL9fHVptIqakV_15Vs/view?usp=sharing

Compile and run the new gtest added here:

https://github.com/ScottMcMichael/gtsam/blob/isam_serialization_bug/gtsam/nonlinear/tests/testSerializationNonlinear.cpp

Either put the data file in the execution directory or manually set the path to the data file in the serialization test.

(executable is testSerializationNonlinear)

Expected behavior

The test will produce the following output:

Write: .//gt0.bin Segmentation fault (core dumped)

Environment

Ubuntu 20.04.5, default CMake build configuration

Additional information

The crash appears to be caused by a stack overflow, either from an infinite recursion or a very expansive recursion that needs to be replaced with another kind of loop. According to the stack trace the two lines of GTSAM code that the program keeps bouncing between are:

542768 gtsam::ISAM2Clique::serialize (ar=..., this=0x7ffa7c001dd0) at .../install/gtsam/include/gtsam/nonlinear/ISAM2Clique.h:154

and

542739 gtsam::BayesTreeCliqueBase<gtsam::ISAM2Clique, gtsam::GaussianFactorGraph>::serialize (ar=..., this=0x7ffa7c001

dd0) at .../install/gtsam/include/gtsam/inference/BayesTreeCliqueBase.h:212

ProfFan commented 1 year ago

Thank you for the report! Is this reproducible with a simple graph that is just super long (like a very long chain of odometry)?

ScottMcMichael commented 1 year ago

It can also be reproduced by a long chain of BetweenFactors. I updated the test file to set that up without using the data file.

ScottMcMichael commented 1 year ago

I have a fix for this but it is much more complicated than the existing code: https://github.com/ScottMcMichael/gtsam/tree/fix_large_graph_serialization_stack_overflow

I have to replace the serialize code in BayesTree.h plus add some support code in the Clique classes. The new serialization code has to replicate some functionality from Boost but it is able to parse the trees non-recursively, eliminating the stack overflow when dealing with large trees. I can clean it up and create a PR but I would like someone to check my approach to make sure I am not overlooking something since I am not entirely sure how the cliques work.