dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
538 stars 29 forks source link

Gather all state pertaining to a fs in one place #625

Open alaric-dotmesh opened 5 years ago

alaric-dotmesh commented 5 years ago

We had a stab at this before (https://github.com/dotmesh-io/dotmesh/issues/186) but that got diverted into refactoring the fsmachines, and then we called it a day and moved on.

However, the original point remains: we have, in the InMemoryState, multiple maps from fsid to various mutable things, and those things get mutated through various paths (largely either through etcd, through the actions of fsmachines, or perhaps directly from RPCs?). This means they can potentially be out of synch, as evidenced by deletion - a deletion RPC may come in, and the fsmachines representing the branches of the deleted dot all get told to delete themselves, and all the nodes will start to process that deletion as soon as their fsmachines have stopped doing what they were previously doing (which could be a long-running operation like a transfer)... which means that any operation performed in the fsmachine needs to handle the contents of the other mutable state in the InMemoryState changing to a deleted state while it's performing any operation.

That, and possibly many others involving operations such as container start/stop, migration, deletion, creation, push, pull have potentially vast numbers of possible overlapping interactions between the different concurrent code paths involved in using that mutable state; it's very hard to have confidence that we have catered for all possible combinations, and the repeated issues with deletion are a shining example of that.

As such, we should:

  1. Only have a single map from filesystem ID to mutable state in the InMemoryState. Either put the masters cache, snapshot cache, and so on into the fsmachine itself, or have a top-level struct with an instance for each fsid that contains all of its mutable state and a reference to the fsmachine.
  2. All reads of this information is done via functions or methods of the InMemoryState, rather than directly, so we can control it properly (A lot of this has been done already).
  3. Put suitable locking around access to fields of that struct, to ensure that the fsmachine has synchronous control over it and things can't change while the fsmachine is using them. Perhaps all updates to this information originating from the KV store or RPC calls or anything else should be turned into messages sent to the fsmachine that cause it to update its state when it's finished the current operation? We would probably want to make exceptions for some simple cases that we CAN reason about, though. The death of a container using a DM volume should probably just be handled as soon as the lock is available, as IIRC nothing in the fsmachine depends on that not changing while operations are in progress.