Closed drpjm closed 6 days ago
Hi @drpjm is this from PyPI or compiled from main?
@ProfFan I tried to build from source and also use PyPI.
Coming very late to this conversation. I did most of the book using python 3.9, and there all tests succeed. But I am seeing segfaults with 3.12. I will try 3.10 and then 3.11, and see whether I can track down the issue.
Python 3.10 works (at least all tests pass without segfault)
OK, repro with Python 3.11.9:
(py311) (gtbook) FranksVrdantMac:build dellaert$ make python-test
[ 17%] Built target cephes-gtsam
[ 32%] Built target metis-gtsam
[ 76%] Built target gtsam
[ 76%] Built target gtsam_unstable_header
[ 76%] Built target pybind_wrap_gtsam_unstable
[ 85%] Built target gtsam_unstable
[ 85%] Built target gtsam_unstable_py
[ 85%] Built target gtsam_header
[ 91%] Built target pybind_wrap_gtsam
[100%] Built target gtsam_py
Segmentation fault
@ProfFan @varunagrawal any ideas? Maybe we need to upgrade pybind?
Might need to run the thing within LLDB and see what is happening
Would you be willing to upgrade pybind and give it a try?
I forget exactly where to do it, please put me on the review so I can do it the next time..l
I had upgraded Pybind11 2 months ago
https://github.com/borglab/wrap/pull/166
I'll take a closer look later today.
My quick recommendation would be to try upgrading to numpy 2.0.0? IIRC there is backwards compatibility with numpy V1, but the symptoms described indicate that maybe numpy 2.0.0 is already being used and it's the latest gtsam python build that needs to be used.
@drpjm can you please report your numpy version here? You can get it with pip show numpy
Cool, thanks @varunagrawal . could you also tell me the PR where this version of wrap was then included into GTSAM? (Submodule or subtree? I forget)
Here you go: https://github.com/borglab/gtsam/pull/1773
@drpjm I re-ran the current version of S24_sorter_perception.ipynb
of the book with the latest version of GTSAM and I am unable to reproduce the issue.
You mention you are running this in a script. Can you please share the script?
Haven't heard back from @drpjm so I will close this for now since I can't reproduce this. If you're still having issues, please feel free to reopen.
@varunagrawal Been very busy and had to track down the code that segfaults. I can add you as a collaborator to try it out. I was using numpy 1.26.2
at the time when the script was written and just tested it now and a segfault was produced.
Wait, @varunagrawal - I have reproduced the segfaults with Python 3.11.9, so I'm re-opening.
I am running with numpy 2.0.1, still segfaults:
(py311) $ /Users/dellaert/mambaforge/envs/py311/bin/python /Users/dellaert/git/github/python/gtsam/tests/test_Factors.py
.Segmentation fault: 11
(py311) $ pip show numpy | grep Version
Version: 2.0.1
@ProfFan or @varunagrawal, with lldb I get below, which is mildly useless. I get unnamed symbols even when compiling GTSAM with Debug. Is that flag propagated correctly to wrap?
(lldb) run
Process 53926 launched: '/Users/dellaert/mambaforge/envs/py311/bin/python' (arm64)
.Process 53926 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000000000000
frame #1: 0x0000000101b95ac8 gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol11562 + 396
frame #2: 0x0000000101f2bb9c gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol31238 + 128
frame #3: 0x000000010199c6ec gtsam.cpython-311-darwin.so`___lldb_unnamed_symbol1632 + 4756
frame #4: 0x00000001000b807c python`cfunction_call + 124
frame #5: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
frame #6: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
frame #7: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #8: 0x00000001000643a4 python`method_vectorcall + 520
frame #9: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
frame #10: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #11: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
frame #12: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
frame #13: 0x00000001000db2c0 python`slot_tp_call + 172
frame #14: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
frame #15: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
frame #16: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #17: 0x00000001000643a4 python`method_vectorcall + 520
frame #18: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
frame #19: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #20: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
frame #21: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
frame #22: 0x00000001000db2c0 python`slot_tp_call + 172
frame #23: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
frame #24: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
frame #25: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #26: 0x00000001000643a4 python`method_vectorcall + 520
frame #27: 0x0000000100164c60 python`_PyEval_EvalFrameDefault + 55160
frame #28: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #29: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
frame #30: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
frame #31: 0x00000001000db2c0 python`slot_tp_call + 172
frame #32: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
frame #33: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
frame #34: 0x0000000100166f0c python`_PyEval_Vector + 184
frame #35: 0x00000001000609f4 python`_PyObject_FastCallDictTstate + 320
frame #36: 0x0000000100061884 python`_PyObject_Call_Prepend + 176
frame #37: 0x00000001000dc8d8 python`slot_tp_init + 196
frame #38: 0x00000001000d4ed0 python`type_call + 464
frame #39: 0x0000000100060788 python`_PyObject_MakeTpCall + 332
frame #40: 0x0000000100162930 python`_PyEval_EvalFrameDefault + 46152
frame #41: 0x0000000100156518 python`PyEval_EvalCode + 220
frame #42: 0x00000001001bc4fc python`run_mod + 144
frame #43: 0x00000001001bbf5c python`_PyRun_SimpleFileObject + 1260
frame #44: 0x00000001001bb01c python`_PyRun_AnyFileObject + 240
frame #45: 0x00000001001e1b30 python`Py_RunMain + 3100
frame #46: 0x00000001001e2988 python`pymain_main + 1252
frame #47: 0x0000000100003958 python`main + 56
frame #48: 0x000000018ee420e0 dyld`start + 2360
OK, after blasting away all my libraries, I have symbols:
test_Factors fails with this, immediately:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x0000000000000000
frame #1: 0x000000010347fcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
frame #2: 0x0000000103e68a58 gtsam.cpython-311-darwin.so`gtsam::NonlinearISAM::getFactorsUnsafe() const + 919916
and test_Cal3Fisheye fails with this:
frame #1: 0x0000000103cefcc4 gtsam.cpython-311-darwin.so`void boost::archive::detail::common_oarchive<boost::archive::binary_oarchive>::save_override<gtsam::PinholePose<gtsam::Cal3Fisheye> const>(gtsam::PinholePose<gtsam::Cal3Fisheye> const&) + 13944
frame #2: 0x0000000103dab060 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13968
frame #3: 0x0000000103daadd4 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 13316
frame #4: 0x0000000103daac5c gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12940
frame #5: 0x0000000103daac14 gtsam.cpython-311-darwin.so`gtsam::Pose2::wedge(double, double, double) + 12868
frame #6: 0x00000001037bcba4 gtsam.cpython-311-darwin.so`pybind11::error_already_set::restore() + 56908
Both seem boost serialization related !
I set up a 3.11.9 environment on my M1 mac and I am again not able to repro. :( All tests pass here. Could it be the way boost is installed? Mine is via homebrew.
Let me see what I can do, from what I see Python is from mambaforge
, 3.11.
Can't reproduce the crash on develop
. This is with boost 1.86 (Homebrew), Python 3.11 on conda-forge
and numpy 2.0.
However the PyPI version does crash. @dellaert Did you reproduce the crash with develop
?
Yeah, this is on develop, and boost 1.86 from brew, and now latest numpy. It could be an installation problem: sometimes I get symbols, sometimes I don’t. But 3.9 and 3.10 just work. Still, let me try and completely blast out my build folder - I do notice that “clean” does not clean everything.
On August 25, 2024, GitHub @.***> wrote:
Can't reproduce the crash on develop. This is with boost 1.86 (Homebrew), Python 3.11 on conda-forge and numpy 2.0.
However the PyPI version does crash. @dellaert https://github.com/dellaert Did you reproduce the crash with develop?
— Reply to this email directly, view it on GitHub https://github.com/borglab/gtsam/issues/1720#issuecomment-2308874162, or unsubscribe https://github.com/notifications/unsubscribe- auth/ACQHGSK5HDCTRREKYYBEAXDZTHTBVAVCNFSM6AAAAABC6AHI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYHA3TIMJWGI. You are receiving this because you were mentioned.Message ID: @.***>
Here is one possible issue. cmake says:
pybind11_DIR */opt/homebrew/share/cmake/pybind11
so it does not seem to pick up on the pybind included with wrap...
Here is one possible issue. cmake says:
pybind11_DIR */opt/homebrew/share/cmake/pybind11
so it does not seem to pick up on the pybind included with wrap...
brew info pybind11
==> pybind11: stable 2.13.5 (bottled)
Should be up-to-date enough? What does this say on your computer?
==> pybind11: stable 2.13.5 (bottled)
But, pybind11 is included in wrap, so the bigger issue is: why does our cmake not use that one. It should not pick up on the brew one, right?
Interesting. Mine says
//Value Computed by CMake
pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11
//The directory containing a CMake configuration file for pybind11.
pybind11_DIR:PATH=pybind11_DIR-NOTFOUND
//Value Computed by CMake
pybind11_IS_TOP_LEVEL:STATIC=OFF
//Value Computed by CMake
pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11
which possibly explains the issue.
I made a PR since this is easy to fix via CMake. @dellaert can you please try it out?
I'll try. In the meantime I'm also trying to create an M1 CI run, to see if the issue is reproducible on github runners
@drpjm that PR #1812 fixed segfaults on my system. Please check it out and/or close this issue?
Thanks @varunagrawal !
@dellaert Would I compile from source or install with pip?
Build from source. ps if you have a minimal repro script I’d love to try it.
Interesting. Mine says
//Value Computed by CMake pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11 //The directory containing a CMake configuration file for pybind11. pybind11_DIR:PATH=pybind11_DIR-NOTFOUND //Value Computed by CMake pybind11_IS_TOP_LEVEL:STATIC=OFF //Value Computed by CMake pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11
which possibly explains the issue.
I still wonder why this fixed the issue. pybind11
is header-only and these variables look totally legit to me...
Also I have the same config at @dellaert and cannot reproduce.
I had another pybind installed using brew and it picked up on that. When I ran make again with the changes, a lot of different flags appeared in the cmake settings as well, indicating it now hooked up to our version.
On August 26, 2024, GitHub @.***> wrote:
Interesting. Mine says
//Value Computed by CMake pybind11_BINARY_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/build/python/pybind11 //The directory containing a CMake configuration file for pybind11. pybind11_DIR:PATH=pybind11_DIR-NOTFOUND //Value Computed by CMake pybind11_IS_TOP_LEVEL:STATIC=OFF //Value Computed by CMake pybind11_SOURCE_DIR:STATIC=/Users/varunagrawal/borglab/gtsam/wrap/pybind11
which possibly explains the issue.
I still wonder why this fixed the issue. pybind11 is header-only and these variables look totally legit to me... Also I have the same config at @dellaert https://github.com/dellaert and cannot reproduce.
— Reply to this email directly, view it on GitHub https://github.com/borglab/gtsam/issues/1720#issuecomment-2310481446, or unsubscribe https://github.com/notifications/unsubscribe- auth/ACQHGSPSYYNZQWM324PDAZ3ZTNCHFAVCNFSM6AAAAABC6AHI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGQ4DCNBUGY. You are receiving this because you were mentioned.Message ID: @.***>
Closing as complete.
Description
I was running through the
robotics
text on performing MAP with multiple sensors and when computing the unnormalized posterior from a DiscreteConditional likelihood, I get a segfault.This is running in Python 3.11.6, Mac OSX 14.3. Mac OS reports the following;
Steps to reproduce
I am running this in a Python script, not a Jupyter notebook. I have a conductivity sensor based on the DiscreteConditional in the robotics textbook in Chapter 2.4.4.
The segfault occurs when I run something similar to the example in Chapter 2.4.10.
Expected behavior
I would expect that the posterior is computed without crashing when multiplying out the likelihood factors and prior.
When I use a DecisionTreeFactor to represent a continuous sensor model, this crash does not occur. So it appears that there is a problem with theIt looks like it happens for any combination of theDiscreteConditional
python object when using the*
operator.DiscreteConditional
orDecisionTreeFactor
.Environment
Python 3.11.6, Mac OSX 14.3 with Apple silicon (M2)