Instability of mip? - Githubissues

BramVanroy commented 1 year ago

Hello

I've just reimplemented my neural network training pipeline and instead of smatch I am using smatchpp. Overall this works great, so thank you for your work!

Unfortunately however I sometimes get a terminal error that disrupts the whole training loop and it cannot be recovered. I have also reported this here. I do not know how to debug this so I am wondering/hoping that you have experience with a similar issue when using mip during testing your library.

This is the error trace, but I can't figure out how to read it. Is mip the trigger, or is torch the trigger? Does it have to do with distributed training? How can I debug this? A lot of questions... So if you have any insights, they are very welcome because this is stopping me from using it in my code as it completely destroys the training progress. smatch does not rely on mip as far as I know.

ERROR while running Cbc. Signal SIGABRT caught. Getting stack trace.
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/mip/libraries/cbc-c-linux-x86-64.so(_Z15CbcCrashHandleri+0x119) [0x7f5f955c3459]
/lib64/libc.so.6(+0x54df0) [0x7f6697654df0]
/lib64/libc.so.6(+0xa154c) [0x7f66976a154c]
/lib64/libc.so.6(raise+0x16) [0x7f6697654d46]
/lib64/libc.so.6(abort+0xd3) [0x7f66976287f3]
/lib64/libstdc++.so.6(+0xa1a01) [0x7f66938a1a01]
/lib64/libstdc++.so.6(+0xad37c) [0x7f66938ad37c]
/lib64/libstdc++.so.6(+0xad3e7) [0x7f66938ad3e7]
/lib64/libstdc++.so.6(+0xad36f) [0x7f66938ad36f]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardENS_17ErrorHandlingModeE+0x278) [0x7f64d9cbd4d8]
/home/local/vanroy/multilingual-text-to-amr/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x19f) [0x7f64d9cc102f]
/lib64/libstdc++.so.6(+0xdb9d4) [0x7f66938db9d4]
/lib64/libc.so.6(+0x9f802) [0x7f669769f802]
/lib64/libc.so.6(+0x3f450) [0x7f669763f450]

flipz357 commented 1 year ago

I have never seen this problem, but I think it probably comes from mip, yes. I find this issue there on a similar problem https://github.com/coin-or/python-mip/issues/254

I see you also filed an issue there, that's good.

For now, to not risk any stoppage while performing training, you can use

solver = solvers.HillClimber()

which is also the default. For training, the feedback from the hill-climber may be sufficient. Only for final evaluation optimal solving should be applied.

flipz357 commented 1 year ago

PS: which version of mip are you running? I see that they're now at 1.15.0. Maybe there was a fix for this bug.

BramVanroy commented 1 year ago

Thank you for the quick response! The error occurred with 1.15.0. I now downgraded to 1.13.0 as per the recommendation in your README but I have no results yet.

It is a good idea to use the hill climber for training and ILP for final evaluation. That should be less problematic in my use case.

Because you suggest it is an mip specific problem and not caused by smatchpp, I'll close this and hope that the people at mip can find a solution but it seems that it is not that easy...

flipz357 commented 1 year ago

Great, feel free to re-open it anytime.

In case you manage/happen to reproduce this bug, I'd be very interested to see what's going on, might also be itneresting for the folks that develop mip.

flipz357 / smatchpp

Instability of mip? #4