dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
539 stars 29 forks source link

Automatic deletion of filesystems found in ZFS and not in the registry #679

Open alaric-dotmesh opened 5 years ago

alaric-dotmesh commented 5 years ago

As a dotmesh runner user, I'd like the ability to have any filesystem found on ZFS and not in the registry to just be deleted, so that I can automatically recover from failure cases that lead to this situation.

Currently, filesystems in this state go into "failed" as per https://github.com/dotmesh-io/dotmesh/blob/master/pkg/fsm/fsm_discovering.go#L13 - on runners, that means that the fsid is claimed by the failed FSM, so subsequent attempts to re-clone the filesystem will just fail with an error. In the case of a runner, deleting a dodgy filesystem is fine, as the storage on a runner exists just as a cache and it will be re-pulled from the hub; so there should be a configurable option, set by the agent when it starts Dotmesh on a runner, to enable this "kill orphaned filesystems" mode.

Key acceptance criteria

lukemarsden commented 5 years ago

can we rename instead of destroying please? i'd rather like losing the kv store not to mean complete catastrophic (and gradual!) data loss

rusenask commented 5 years ago

@lukemarsden we had a quick chat about this with @alaric-dotmesh, so the idea now would be to have two modes:

lukemarsden commented 5 years ago

got it, we might still care about data on runners though? a data scientist does 7 weeks of critical work to cure cancer offline, reconnects and loses their kv store

rusenask commented 5 years ago

good point :D

alaric-dotmesh commented 5 years ago

Rename at the ZFS level, I presume you mean, as this filesystem doesn't have a registry entry so no "name" in the dotmesh level! But if we do this on a runner (and it'll only happen on runners, or other similar "I'm just a cache" setups), who will ever go and see the renamed ZFS filesystems? You'd need to do a zfs list (or we add a special ListOrphanedFilesystems API call, and a second API call to reconnect an orphaned filesystem by giving it a new fsid and a registry entry and related metadata, but then who is responsible for checking for orphaned filesystems?)

alaric-dotmesh commented 5 years ago

The "nice alternative" is to store all the KV metadata somewhere in ZFS as well so it can be recovered (then we don't even need KV store persistence in single-node systems), but that's a bit more work...

alaric-dotmesh commented 5 years ago

In fact, on a single-node system, we could ditch the entire KV store and just use the caches in InMemoryState directly :-)

rusenask commented 5 years ago

or we could place boltdb data file inside a zfs filesystem :)

lukemarsden commented 5 years ago

putting boltdb in a specially named zfs filesystem is a great idea and should reduce the need for this auto-deletion?

rusenask commented 5 years ago

yep, I am not even sure why I made boltdb location configurable, it should always have been in zfs as it would ensure that as long as zfs is there we would have boltdb too. And if zfs is gone then having a surviving kv store doesn't make any difference.