dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
539 stars 29 forks source link

#788: list commits immediately on fs init #789

Closed itamarst closed 4 years ago

itamarst commented 4 years ago

Fixes #788

Or not? I can't reproduce the problem, so I wrote a test and potential fix but I can't be sure it did anything. @lukemarsden can you see if this fixes it for you?

itamarst commented 4 years ago

Had a bug in previous version, so testing should await me getting tests passing.

itamarst commented 4 years ago

OK, fixed that I think.

lukemarsden commented 4 years ago

Thanks for the PR, and for digging into dotmesh!

Dots (filesystems) with no commits (snapshots) on them are certainly valid system state, so I'm a bit concerned about the 10 second delay on Commits calls for that case. Do we create initial commits on new dots in dotmesh now? (I remember we did that from inside the gateway for a while, as for the dotscience use case we need an initial commit to be able to synchronize even an empty filesystem to a runner etc). That might make this change less worrisome, but I think there might also be a better approach.

Overall I think this proposed change feels like a band-aid on the rpc side, when there's a possible "real" fix in the guts of the system we could investigate first.

In response to your comment here:

Digging through the code, I eventually found InitFilesystemMachine in dotmesh, which does the following:

      go s.filesystems[filesystemId].Run() // concurrently run state machine

      return s.filesystems[filesystemId], false

This suggests filesystem creation happens in background, if I'm interpreting code correctly...

InitFilesystemMachine is about initializing a state machine to check (and modify) the state of the system, not (necessarily) creating a filesystem. Filesystems can show up in a number of ways, they can be created explicitly on the host via the CreateFilesystem method here:

https://github.com/dotmesh-io/dotmesh/blob/75f9c909d6da9ec67aa30d575024b0a6903cec73/cmd/dotmesh-server/controller.go#L613

They can also be received in a new cross-cluster replication stream (push/clone), or they can be received in a cluster-local replication (replication, moving dots between nodes).

The case in https://github.com/dotmesh-io/dotmesh/issues/788 is that the runner is the initiator of a dm clone style operation. I.e. cross cluster replication pulling a dot from the hub to the runner, where the runner is the initiator of the clone. I'm pretty sure the bug only happens when the dot doesn't exist on the runner prior to the replication.

The codepath I had in mind to try to fix this in was not the rpc side, where this feels like a band-aid (your diff takes a hint that filesystem state might not be up to date, and adds latency), but rather in the codepath that receives a cross-cluster replication stream. Can we ensure that we set up the snapshot list in memory before we return success from the (effective) dm clone?

lukemarsden commented 4 years ago

So in particular (from the original bug report): image When does some dotmesh API response cause Pull finished to be logged by the dotscience-agent, and at that point has dotmesh already synchronously listed the snaps on the filesystem? And if not can we make it so that it does?

itamarst commented 4 years ago

OK, going to try that approach in a different branch, starting with seeing if I can reproduce the bug with a test based off dm clone.