dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
381 stars 134 forks source link

shared-memory1 is flaky #69

Closed karya0 closed 9 years ago

karya0 commented 9 years ago

It sometimes fail with the following error (observed with ./test/autotest.py -v shared-memory1):

[14211] mtcp_restart.c:1085 read_shared_memory_area_from_file:
  error 2 opening mmap file /home/kapil/dmtcp/dmtcp-shared-memory.QExEDL
gc00 commented 9 years ago

On 32-bit Ubuntu 9.10 and 32-bit Red Hat 6, I've observed that about 10% of the time, it can fail to restart under autotest and the restart exits, but autotest then tries again and succeeds.

I then ran make -j AUTOTEST="-v --stress" check-shared-memory1 > tmp.log 2>&1

@karya0: In the current DMTCP, I see the bug that you reported with error 2 (but only on the slower 32-bit Ubuntu 9.10). In addition, I'm seeing this second form of the bug, also related to read_shared_memory_area_from_file:

[24034] mtcp_restart.c:1264 read_shared_memory_area_from_file:
 mapping current version of /tmp/gene/dmtcp-dmtcp-ee5cdb6/dmtcp-shared-memory.bJG3qS into memory;
 _not_ file as it existed at time of checkpoint.
 (Or this may be a file shared by multiple processes.)
 Change mtcp_restart.c:1264 and re-compile, if you want different behavior.

In both cases, it seems like the two processes have a race condition on restart. Perhaps one process finishes mtcp_restart early and enters DMTCP's Util::runMtcpRestore, while the other process has hardly begun mtcp_restart. By the way, I find it much easier to generate the bug on 32-bit Linux, perhaps because the 32-bit Linux is slower.

gc00 commented 9 years ago

@karya0: After some repeated testing on x86 (32-bit) Red Hat 6 at batlab, I think I've pinned down the bug to:

commit f98431d70eb6adf30f6808ec3f88fef21df33d59
Author: Gene Cooperman <gene@ccs.neu.edu>
Date:   Thu Mar 12 12:12:11 2015 -0400

    In migration, MAP_SHARED areas may fail.  Fixed.

Because the bug hits randomly and even disappears at times, I can't be sure of it. But using autotest.py --stress with a screenful of tests, it seems to indicate that this is the cause. It's reasonable, since they both involve shared memory.