dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
384 stars 134 forks source link

Latest version fails to launch command with "ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock" #1158

Open NeuralModder opened 1 week ago

NeuralModder commented 1 week ago

This program seems like it could be useful to me. I've tried multiple tagged versions and branches so far; it seems that each generates a different error (whether during compilation or trying to launch a program)

On the latest version (3.1.0), the program builds successfully. However, if I try to launch any program, I get this error:

[youser@hoastnayme ~/src/dmtcp/bin]$ ./dmtcp_launch cat
ERROR 0 at threadsync.cpp:176 void dmtcp::ThreadSync::wrapperExecutionLockLock(): Failed to acquire lock

Getting the coordinator logs by running the coordinator in one terminal and ./dmtcp_launch -j -p 7779 cat in another doesn't seem (to me) to provide particularly helpful information either:

[2024-10-02T17:01:03.452, 26674, 26674, Note] at dmtcp_coordinator.cpp:933 in initializeComputation; REASON='Resetting computation
[2024-10-02T17:01:03.452, 26674, 26674, Note] at dmtcp_coordinator.cpp:1042 in onConnect; REASON='worker connected
     hello_remote.from = 5092d13623c01989-26813-2fd28d405cefe
     client->progname() = cat
[2024-10-02T17:01:03.463, 26674, 26674, Note] at dmtcp_coordinator.cpp:891 in onDisconnect; REASON='client disconnected
     client->identity() = 5092d13623c01989-26813-2fd28d405cefe
     client->progname() = cat
dmtcp>

I've tried looking at the named code location in threadsync.cpp, but I'm not very familiar with C++ and am rather puzzled.

Here's some system information that might be relevant:

Could my glibc version be too recent? Is there anything I can try to solve this problem, or any more information I could provide?

gc00 commented 15 hours ago

Hi @NeuralModder ,

Thanks for the feedback. I'm not able to reproduce this bug. For example, I'm using glibc-2.37, with Fedora 38, and I'm nnot seeing it. We are now also on DMTCP-3.1.2 (recent fix for a regression).

If you have the chance, could you try:

./configure --enable-debug && make clean && make -j8

and then try:

gdb --args bin/dmtcp_launch cat (gdb) run (gdb) thread apply all where

And then paste the result here. Thanks.