Closed karya0 closed 9 years ago
On 32-bit Ubuntu 9.10 and 32-bit Red Hat 6, I've observed that about 10% of the time, it can fail to restart under autotest and the restart exits, but autotest then tries again and succeeds.
I then ran make -j AUTOTEST="-v --stress" check-shared-memory1 > tmp.log 2>&1
@karya0: In the current DMTCP, I see the bug that you reported with error 2
(but only on the slower 32-bit Ubuntu 9.10). In addition, I'm seeing this second form of the bug, also related to read_shared_memory_area_from_file
:
[24034] mtcp_restart.c:1264 read_shared_memory_area_from_file:
mapping current version of /tmp/gene/dmtcp-dmtcp-ee5cdb6/dmtcp-shared-memory.bJG3qS into memory;
_not_ file as it existed at time of checkpoint.
(Or this may be a file shared by multiple processes.)
Change mtcp_restart.c:1264 and re-compile, if you want different behavior.
In both cases, it seems like the two processes have a race condition on restart. Perhaps one process finishes mtcp_restart early and enters DMTCP's Util::runMtcpRestore
, while the other process has hardly begun mtcp_restart. By the way, I find it much easier to generate the bug on 32-bit Linux, perhaps because the 32-bit Linux is slower.
@karya0: After some repeated testing on x86 (32-bit) Red Hat 6 at batlab, I think I've pinned down the bug to:
commit f98431d70eb6adf30f6808ec3f88fef21df33d59
Author: Gene Cooperman <gene@ccs.neu.edu>
Date: Thu Mar 12 12:12:11 2015 -0400
In migration, MAP_SHARED areas may fail. Fixed.
Because the bug hits randomly and even disappears at times, I can't be sure of it. But using autotest.py --stress
with a screenful of tests, it seems to indicate that this is the cause. It's reasonable, since they both involve shared memory.
It sometimes fail with the following error (observed with
./test/autotest.py -v shared-memory1
):