coin-or / python-mip

Python-MIP: collection of Python tools for the modeling and solution of Mixed-Integer Linear programs
Eclipse Public License 2.0
540 stars 95 forks source link

More verbose errors? #357

Open BramVanroy opened 1 year ago

BramVanroy commented 1 year ago

Hello

Thank you for your work. I use mip as part of a neural network training pipeline. I used it specifically in an evaluation metric, smatchpp in a multi-node, multi-thread environment. I just found that my training sometimes, non-deterministically, seems to crash but I can't figure out where the problem lies (in my own code or in the smatchpp lib) because the error trace is so obfuscated. This is what I see:

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/mip/libraries/cbc-c-linux-x86-64.so(_Z15CbcCrashHandleri+0x119) [0x7f5f955c3459]
/lib64/libc.so.6(+0x54df0) [0x7f6697654df0]
/lib64/libc.so.6(+0xa154c) [0x7f66976a154c]
/lib64/libc.so.6(raise+0x16) [0x7f6697654d46]
/lib64/libc.so.6(abort+0xd3) [0x7f66976287f3]
/lib64/libstdc++.so.6(+0xa1a01) [0x7f66938a1a01]
/lib64/libstdc++.so.6(+0xad37c) [0x7f66938ad37c]
/lib64/libstdc++.so.6(+0xad3e7) [0x7f66938ad3e7]
/lib64/libstdc++.so.6(+0xad36f) [0x7f66938ad36f]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardENS_17ErrorHandlingModeE+0x278) [0x7f64d9cbd4d8]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x19f) [0x7f64d9cc102f]
/lib64/libstdc++.so.6(+0xdb9d4) [0x7f66938db9d4]
/lib64/libc.so.6(+0x9f802) [0x7f669769f802]
/lib64/libc.so.6(+0x3f450) [0x7f669763f450]

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.

I have no idea how to read this (I am used to Python stack traces). I see references to both mip (at the top) and torch near the end. So who was causing the error, mip or torch? And how can I pinpoint where the issue lies? Is it possible to get or implement more verbose error traces for mip?

rschwarz commented 1 year ago

I guess that the issue here lies within the Cbc solver (in its shared library), not Python code.

BramVanroy commented 1 year ago

@rschwarz Thank you for the reply. Does that mean I should report it elsewhere? What would be the right place?

ckchow commented 1 year ago

https://github.com/coin-or/Cbc/issues (I believe I'm having a similar issue)