dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
538 stars 29 forks source link

[1d INTERRUPTION] TestRecoverFromUnmountedDotOnMaster/FilesystemRemountedOnRestart is failing #620

Closed alaric-dotmesh closed 5 years ago

alaric-dotmesh commented 5 years ago

This happens:

Jan 25 15:55:06 Dotmesh containers running on cluster-15396805477938785-0-node-0: RUNNING on cluster-15396805477938785-0-node-0: docker start dotmesh-server
Jan 25 15:55:08 Dotmesh containers running on cluster-15396805477938785-0-node-0: Dotmesh containers running on cluster-15396805477938785-0-node-0: Dotmesh containers running on cluster-15396805477938785-0-node-0: Error checking for bb555c5f-b0ba-4d57-59ce-c83cddb04966 to be mounted: bb555c5f-b0ba-4d57-59ce-c83cddb04966 not mounted, pausing and trying again...
Jan 25 15:55:09 Error checking for bb555c5f-b0ba-4d57-59ce-c83cddb04966 to be mounted: bb555c5f-b0ba-4d57-59ce-c83cddb04966 not mounted, pausing and trying again...
Jan 25 15:55:10 Error checking for bb555c5f-b0ba-4d57-59ce-c83cddb04966 to be mounted: bb555c5f-b0ba-4d57-59ce-c83cddb04966 not mounted, pausing and trying again...
Jan 25 15:55:12 Error checking for bb555c5f-b0ba-4d57-59ce-c83cddb04966 to be mounted: bb555c5f-b0ba-4d57-59ce-c83cddb04966 not mounted, pausing and trying again...

This change has happened since commit 74cd440529d1bc95e39b7a48ef91e0051675e665 or 320d54281a0b47b041342f047c6c88083a8ffdcf ; somehow, those changes to how the dm cluster command works have affected this case!

alaric-dotmesh commented 5 years ago

This is a timing issue, which has made it tricky to analyse as all the state is "fixed" by the time the test has failed and I get in there to look at it :-)

FsMachine.discover() calls FsMachine.snapshotsChanged which calls InMemoryState.UpdateSnapshotsFromKnownState which calls Registry.CurrentMasterNode which fails No known filesystem with id which means that the filesystem doesn't get mounted (it gets stuck in the missing state), but the masters cache learns about that filesystem fairly soon afterwards. It just doesn't know about it when initial discovery is happening.

The filesystem gets stuck in this state:

  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.DONE": "yes",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.STARTED": "yes",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.currentState": "missing",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.dirtyDelta": "0",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.filesystem.exists": "true",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.filesystem.id": "a37460ff-3dfe-4bc0-5c58-738896100690",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.filesystem.mounted": "false",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.filesystem.origin": "@",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.id": "a37460ff-3dfe-4bc0-5c58-738896100690",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.lastTransferRequest": "{\"Peer\":\"\",\"User\":\"\",\"Port\":0,\"ApiKey\":\"\",\"Direction\":\"\",\"LocalNamespace\":\"\",\"LocalName\":\"\",\"LocalBranchName\":\"\",\"RemoteNamespace\":\"\",\"RemoteName\":\"\",\"RemoteBranchName\":\"\",\"TargetCommit\":\"\",\"StashDivergence\":false}",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.lastTransferRequestId": "",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.lastTransitionTimestamp": "1548772167390285390",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.sizeBytes": "0",
  "filesystems.a37460ff-3dfe-4bc0-5c58-738896100690.status": "waiting for snapshots or requests",

Putting a 5s sleep in discover makes it all work fine:

func (f *FsMachine) discover() error {
    // discover system state synchronously
    filesystem, err := f.zfs.DiscoverSystem(f.filesystemId)
    if err != nil {
        return err
    }

    f.snapshotsLock.Lock()
    f.filesystem = filesystem
    f.snapshotsLock.Unlock()

    // ABS FIXME
    time.Sleep(5 * time.Second)
    // ABS FIXME

    err = f.snapshotsChanged()
    if err != nil {
        return fmt.Errorf("Error updating snapshots from discover(): %s", err)
    } else {
        return nil
    }
}

So, I'm going to work out the code path by which the masters cache is primed, and try to make it happen before discovery does.

alaric-dotmesh commented 5 years ago

That seems to have fixed it - and not made anything else worse!

rusenask commented 5 years ago

LGTM