borglab / gtsam

GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vision, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices.
http://gtsam.org
Other
2.55k stars 754 forks source link

Python Segfault on MacOS #1720

Closed drpjm closed 6 days ago

drpjm commented 7 months ago

Description

I was running through the robotics text on performing MAP with multiple sensors and when computing the unnormalized posterior from a DiscreteConditional likelihood, I get a segfault.

This is running in Python 3.11.6, Mac OSX 14.3. Mac OS reports the following;


Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libgtsam.4.2.0.dylib                   0x106085dd8 gtsam::DecisionTree<unsigned long long, double>::Choice::choose(unsigned long long const&, unsigned long) const + 96
1   libgtsam.4.2.0.dylib                   0x10609f210 gtsam::DiscreteConditional::likelihood(gtsam::DiscreteValues const&) const + 220
2   libgtsam.4.2.0.dylib                   0x10609f6d4 gtsam::DiscreteConditional::likelihood(unsigned long) const + 136
3   gtsam.cpython-311-darwin.so            0x106accfa8 0x106980000 + 1363880
4   gtsam.cpython-311-darwin.so            0x106996400 0x106980000 + 91136
5   Python                                 0x1056425b8 cfunction_call + 60
6   Python                                 0x1055f78f4 _PyObject_MakeTpCall + 128
7   Python                                 0x1056d5984 _PyEval_EvalFrameDefault + 42108
8   Python                                 0x1056ca8c4 PyEval_EvalCode + 168
9   Python                                 0x1057213f0 run_eval_code_obj + 84
10  Python                                 0x105721354 run_mod + 112
11  Python                                 0x105721194 pyrun_file + 148
12  Python                                 0x105720be4 _PyRun_SimpleFileObject + 268
13  Python                                 0x10572057c _PyRun_AnyFileObject + 216
14  Python                                 0x10573d164 pymain_run_file_obj + 220
15  Python                                 0x10573caa4 pymain_run_file + 72
16  Python                                 0x10573c384 Py_RunMain + 704
17  Python                                 0x10573d4c0 Py_BytesMain + 40
18  dyld                                   0x1864f10e0 start + 2360

Steps to reproduce

I am running this in a Python script, not a Jupyter notebook. I have a conductivity sensor based on the DiscreteConditional in the robotics textbook in Chapter 2.4.4.

The segfault occurs when I run something similar to the example in Chapter 2.4.10.

posterior = conductivity_factor * detector_factor * weight_factor * category_prior

Expected behavior

I would expect that the posterior is computed without crashing when multiplying out the likelihood factors and prior. When I use a DecisionTreeFactor to represent a continuous sensor model, this crash does not occur. So it appears that there is a problem with the DiscreteConditional python object when using the * operator. It looks like it happens for any combination of the DiscreteConditional or DecisionTreeFactor.

Environment

Python 3.11.6, Mac OSX 14.3 with Apple silicon (M2)

ProfFan commented 5 months ago

Hi @drpjm is this from PyPI or compiled from main?

drpjm commented 4 months ago

@ProfFan I tried to build from source and also use PyPI.

dellaert commented 3 weeks ago

Coming very late to this conversation. I did most of the book using python 3.9, and there all tests succeed. But I am seeing segfaults with 3.12. I will try 3.10 and then 3.11, and see whether I can track down the issue.

dellaert commented 3 weeks ago

Python 3.10 works (at least all tests pass without segfault)

dellaert commented 3 weeks ago

OK, repro with Python 3.11.9:

(py311) (gtbook) FranksVrdantMac:build dellaert$ make python-test
[ 17%] Built target cephes-gtsam
[ 32%] Built target metis-gtsam
[ 76%] Built target gtsam
[ 76%] Built target gtsam_unstable_header
[ 76%] Built target pybind_wrap_gtsam_unstable
[ 85%] Built target gtsam_unstable
[ 85%] Built target gtsam_unstable_py
[ 85%] Built target gtsam_header
[ 91%] Built target pybind_wrap_gtsam
[100%] Built target gtsam_py
Segmentation fault
dellaert commented 3 weeks ago

@ProfFan @varunagrawal any ideas? Maybe we need to upgrade pybind?

ProfFan commented 3 weeks ago

Might need to run the thing within LLDB and see what is happening

dellaert commented 3 weeks ago

Would you be willing to upgrade pybind and give it a try?

dellaert commented 3 weeks ago

I forget exactly where to do it, please put me on the review so I can do it the next time..l

varunagrawal commented 3 weeks ago

I had upgraded Pybind11 2 months ago

https://github.com/borglab/wrap/pull/166

I'll take a closer look later today.

varunagrawal commented 3 weeks ago

My quick recommendation would be to try upgrading to numpy 2.0.0? IIRC there is backwards compatibility with numpy V1, but the symptoms described indicate that maybe numpy 2.0.0 is already being used and it's the latest gtsam python build that needs to be used.

@drpjm can you please report your numpy version here? You can get it with pip show numpy

dellaert commented 3 weeks ago

Cool, thanks @varunagrawal . could you also tell me the PR where this version of wrap was then included into GTSAM? (Submodule or subtree? I forget)

varunagrawal commented 3 weeks ago

Here you go: https://github.com/borglab/gtsam/pull/1773

varunagrawal commented 3 weeks ago

@drpjm I re-ran the current version of S24_sorter_perception.ipynb of the book with the latest version of GTSAM and I am unable to reproduce the issue. You mention you are running this in a script. Can you please share the script?

varunagrawal commented 3 weeks ago

Haven't heard back from @drpjm so I will close this for now since I can't reproduce this. If you're still having issues, please feel free to reopen.

drpjm commented 3 weeks ago

@varunagrawal Been very busy and had to track down the code that segfaults. I can add you as a collaborator to try it out. I was using numpy 1.26.2 at the time when the script was written and just tested it now and a segfault was produced.

dellaert commented 2 weeks ago

Wait, @varunagrawal - I have reproduced the segfaults with Python 3.11.9, so I'm re-opening.

dellaert commented 2 weeks ago

I am running with numpy 2.0.1, still segfaults:

(py311) $ /Users/dellaert/mambaforge/envs/py311/bin/python /Users/dellaert/git/github/python/gtsam/tests/test_Factors.py
.Segmentation fault: 11
(py311) $ pip show numpy | grep Version
Version: 2.0.1
dellaert commented 2 weeks ago

@ProfFan or @varunagrawal, with lldb I get below, which is mildly useless. I get unnamed symbols even when compiling GTSAM with Debug. Is that flag propagated correctly to wrap?

(lldb) run
Process 53926 launched: '/Users/dellaert/mambaforge/envs/py311/bin/python' (arm64)
.Process 53926 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x0000000101b95ac8 gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol11562 + 396
    frame #2: 0x0000000101f2bb9c gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol31238 + 128
    frame #3: 0x000000010199c6ec gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol1632 + 4756
    frame #4: 0x00000001000b807c python`cfunction_call + 124
    frame #5: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #6: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #7: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #8: 0x00000001000643a4 python`method_vectorcall + 520
    frame #9: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #10: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #11: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #12: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #13: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #14: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #15: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #16: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #17: 0x00000001000643a4 python`method_vectorcall + 520
    frame #18: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #19: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #20: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #21: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #22: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #23: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #24: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #25: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #26: 0x00000001000643a4 python`method_vectorcall + 520
    frame #27: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
    frame #28: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #29: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #30: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #31: 0x00000001000db2c0 python`slot_tp_call + 172
    frame #32: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #33: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #34: 0x0000000100166f0c python`_PyEval_Vector + 184
    frame #35: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
    frame #36: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
    frame #37: 0x00000001000dc8d8 python`slot_tp_init + 196
    frame #38: 0x00000001000d4ed0 python`type_call + 464
    frame #39: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
    frame #40: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
    frame #41: 0x0000000100156518 python`PyEval_EvalCode + 220
    frame #42: 0x00000001001bc4fc python`run_mod + 144
    frame #43: 0x00000001001bbf5c python`_PyRun_SimpleFileObject + 1260
    frame #44: 0x00000001001bb01c python`_PyRun_AnyFileObject + 240
    frame #45: 0x00000001001e1b30 python`Py_RunMain + 3100
    frame #46: 0x00000001001e2988 python`pymain_main + 1252
    frame #47: 0x0000000100003958 python`main + 56
    frame #48: 0x000000018ee420e0 dyld`start + 2360
dellaert commented 2 weeks ago

OK, after blasting away all my libraries, I have symbols:

test_Factors fails with this, immediately:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x000000010347fcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
    frame #2: 0x0000000103e68a58 gtsam.cpython-311-darwin.so`gtsam::NonlinearISAM::getFactorsUnsafe() const + 919916

and test_Cal3Fisheye fails with this:

    frame #1: 0x0000000103cefcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
    frame #2: 0x0000000103dab060 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13968
    frame #3: 0x0000000103daadd4 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13316
    frame #4: 0x0000000103daac5c gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12940
    frame #5: 0x0000000103daac14 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12868
    frame #6: 0x00000001037bcba4 gtsam.cpython-311-darwin.so`pybind11::error_already_set::restore() + 56908

Both seem boost serialization related !

varunagrawal commented 2 weeks ago

I set up a 3.11.9 environment on my M1 mac and I am again not able to repro. :( All tests pass here. Could it be the way boost is installed? Mine is via homebrew.

ProfFan commented 2 weeks ago

Let me see what I can do, from what I see Python is from mambaforge, 3.11.

ProfFan commented 2 weeks ago

Can't reproduce the crash on develop. This is with boost 1.86 (Homebrew), Python 3.11 on conda-forge and numpy 2.0.

However the PyPI version does crash. @dellaert Did you reproduce the crash with develop?

dellaert commented 2 weeks ago

Yeah, this is on develop, and boost 1.86 from brew, and now latest numpy. It could be an installation problem: sometimes I get symbols, sometimes I don’t. But 3.9 and 3.10 just work. Still, let me try and completely blast out my build folder - I do notice that “clean” does not clean everything.

On August 25, 2024, GitHub @.***> wrote:

Can't reproduce the crash on develop. This is with boost 1.86 (Homebrew), Python 3.11 on conda-forge and numpy 2.0.

However the PyPI version does crash. @dellaert https://github.com/dellaert Did you reproduce the crash with develop?

— Reply to this email directly, view it on GitHub https://github.com/borglab/gtsam/issues/1720#issuecomment-2308874162, or unsubscribe https://github.com/notifications/unsubscribe- auth/ACQHGSK5HDCTRREKYYBEAXDZTHTBVAVCNFSM6AAAAABC6AHI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYHA3TIMJWGI. You are receiving this because you were mentioned.Message ID: @.***>

dellaert commented 2 weeks ago

Here is one possible issue. cmake says:

 pybind11_DIR                    */opt/homebrew/share/cmake/pybind11

so it does not seem to pick up on the pybind included with wrap...

ProfFan commented 2 weeks ago

Here is one possible issue. cmake says:

 pybind11_DIR                    */opt/homebrew/share/cmake/pybind11

so it does not seem to pick up on the pybind included with wrap...

brew info pybind11
==> pybind11: stable 2.13.5 (bottled)

Should be up-to-date enough? What does this say on your computer?

dellaert commented 2 weeks ago
==> pybind11: stable 2.13.5 (bottled)

But, pybind11 is included in wrap, so the bigger issue is: why does our cmake not use that one. It should not pick up on the brew one, right?

varunagrawal commented 2 weeks ago

Interesting. Mine says

//Value Computed by CMake
pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11

//The directory containing a CMake configuration file for pybind11.
pybind11_DIR:PATH=pybind11_DIR-NOTFOUND

//Value Computed by CMake
pybind11_IS_TOP_LEVEL:STATIC=OFF

//Value Computed by CMake
pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11

which possibly explains the issue.

varunagrawal commented 2 weeks ago

I made a PR since this is easy to fix via CMake. @dellaert can you please try it out?

dellaert commented 2 weeks ago

I'll try. In the meantime I'm also trying to create an M1 CI run, to see if the issue is reproducible on github runners

dellaert commented 2 weeks ago

@drpjm that PR #1812 fixed segfaults on my system. Please check it out and/or close this issue?

dellaert commented 2 weeks ago

Thanks @varunagrawal !

drpjm commented 2 weeks ago

@dellaert Would I compile from source or install with pip?

dellaert commented 2 weeks ago

Build from source. ps if you have a minimal repro script I’d love to try it.

ProfFan commented 2 weeks ago

Interesting. Mine says

//Value Computed by CMake
pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11

//The directory containing a CMake configuration file for pybind11.
pybind11_DIR:PATH=pybind11_DIR-NOTFOUND

//Value Computed by CMake
pybind11_IS_TOP_LEVEL:STATIC=OFF

//Value Computed by CMake
pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11

which possibly explains the issue.

I still wonder why this fixed the issue. pybind11 is header-only and these variables look totally legit to me... Also I have the same config at @dellaert and cannot reproduce.

dellaert commented 2 weeks ago

I had another pybind installed using brew and it picked up on that. When I ran  make again with the changes, a lot of different flags appeared in the cmake settings as well, indicating it now hooked up to our version.

On August 26, 2024, GitHub @.***> wrote:

Interesting. Mine says

//Value Computed by CMake pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11 //The directory containing a CMake configuration file for pybind11. pybind11_DIR:PATH=pybind11_DIR-NOTFOUND //Value Computed by CMake pybind11_IS_TOP_LEVEL:STATIC=OFF //Value Computed by CMake pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11

which possibly explains the issue.

I still wonder why this fixed the issue. pybind11 is header-only and these variables look totally legit to me...  Also I have the same config at @dellaert https://github.com/dellaert and cannot reproduce.

— Reply to this email directly, view it on GitHub https://github.com/borglab/gtsam/issues/1720#issuecomment-2310481446, or unsubscribe https://github.com/notifications/unsubscribe- auth/ACQHGSPSYYNZQWM324PDAZ3ZTNCHFAVCNFSM6AAAAABC6AHI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGQ4DCNBUGY. You are receiving this because you were mentioned.Message ID: @.***>

varunagrawal commented 6 days ago

Closing as complete.