dmtcp / dmtcp

DMTCP: Distributed MultiThreaded CheckPointing
http://dmtcp.sourceforge.net/
Other
375 stars 133 forks source link

Nameservice socket to coordinator invalid after restart. #27

Closed karya0 closed 9 years ago

karya0 commented 9 years ago

Description as received from @jiajuncao

In coordinatorAPI.cpp, we have 2 sockets: _coordinatorSocket and _nsSock. When DMTCP is not in running state, _coordinatorSocket is used for the publish/subscribe service (_nsSock and _coordinatorSocket are the same). To support publish/subscribe during running state, an extra socket is created (_nsSock = createNewSocketToCoordinator(COORD_ANY)). The IB plugin takes use of this feature after a restart: it needs to subscribe information about the remote memory region key on running. The first restart is OK. But when we try to checkpoint after the first restart and then restart the second time: _nsSock is a bad file descriptor, it remains the same fd as the one created after last restart, but it's not valid anymore. I think it's either closed or not successfully restored.

Here is a simple plugin doing just publish/subscribe. It's a stand-alone version based on example-db. You can test it by running dmtcp1. Again, the test procedure is ckpt, restart, ckpt, restart. It'll fail after the second restart.

#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include "dmtcp.h"

struct keyPid {
  int key;
  pid_t pid;
} mystruct, mystruct_other;

static int is_restart = 0;
uint32_t sizeofPid;

void dmtcp_event_hook(DmtcpEvent_t event, DmtcpEventData_t *data)
{

  /* NOTE:  See warning in plugin/README about calls to printf here. */
  switch (event) {
  case DMTCP_EVENT_RESTART:
    is_restart = 1;
    break;
  case DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA:
    if (is_restart) {
      mystruct.key = 1;
      mystruct.pid = getpid();
      dmtcp_send_key_val_pair_to_coordinator("ex-db",
                                             &(mystruct.key),
                                             sizeof(mystruct.key),
                                             &(mystruct.pid),
                                             sizeof(mystruct.pid));
    }
    break;
  default:
    break;
  }
  DMTCP_NEXT_EVENT_HOOK(event, data);
}

unsigned int sleep(unsigned int seconds)
{
  unsigned int result = NEXT_FNC(sleep)(seconds);
  if (is_restart) {
    sizeofPid = sizeof(mystruct_other.pid);
    mystruct_other.key = 1;
    dmtcp_send_query_to_coordinator("ex-db",
                                    &(mystruct_other.key),
                                    sizeof(mystruct_other.key),
                                    &(mystruct_other.pid),
                                    &sizeofPid);
    printf("Pid returned from coordinator is %ld.\n", (long)mystruct_other.pid);
  }

  return result;
}
karya0 commented 9 years ago

This has been fixed in 856a4c2aa483a25847107ca0fb1a60df2721ddc8.