dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
538 stars 29 forks source link

[2d INTERRUPTION] Dotmesh pool id somehow changed twice #621

Closed Godley closed 5 years ago

Godley commented 5 years ago

Dotmesh got into a bad state on dotscience production, wherein it didn't respond to requests (or did with connection refused/timeout). This appears to be down to it attempting to reconcile filesystems which were associated with a different pool...

root@dotscience-beta-production-singlenode:~# dotmesh-etcd ls /dotmesh.io/servers/addresses
/dotmesh.io/servers/addresses/8342000fca164092
root@dotscience-beta-production-singlenode:~# dotmesh-etcd ls /dotmesh.io/servers/snapshots
/dotmesh.io/servers/snapshots/24c4d9d9df2389c8
/dotmesh.io/servers/snapshots/79e582b945f41900
/dotmesh.io/servers/snapshots/8342000fca164092

Not sure how this happened but @rusenask suggestion is to write the ID to a file, so that we're not continually relying on zfs to give us the exact same one every time, and refuse to startup if the ID changed.

Godley commented 5 years ago

Aight so I've put in the suggestion, I'm hoping this will at least allow us to investigate how this happened in future. I've then gone back, started a node using the last backup (January 23rd), that gives: time="2019-02-01T10:32:06Z" level=info msg="Detected my node ID as abd8afff16db76f9 ([172.18.0.7])" right at the beginning of docker logs dotmesh-server-inner. Looked in etcd and there's 2 server snapshot entries:

/ # etcdctl ls /dotmesh.io/servers/snapshots
/dotmesh.io/servers/snapshots/79e582b945f41900
/dotmesh.io/servers/snapshots/abd8afff16db76f9
/ # etcdctl ls /dotmesh.io/servers/addresses
/dotmesh.io/servers/addresses/abd8afff16db76f9

zfs says:

 zfs get -H guid pool
pool    guid    14787429460124929250    -

I then went back to the one before that (early december):

/ # etcdctl ls /dotmesh.io/servers/addresses
/dotmesh.io/servers/addresses/c81987c52d13ede8
/ # etcdctl ls /dotmesh.io/servers/snapshots
/dotmesh.io/servers/snapshots/79e582b945f41900
/dotmesh.io/servers/snapshots/c81987c52d13ede8

zfs:

zfs get -H guid pool
pool    guid    12872887580815842038    -

So it seems to change a lot? I'm going to have a look at docker logs dotmesh-server to check if it's like. Recreating the pool for some reason.

Godley commented 5 years ago

I put in a check which will write the id to a file if it doesn't already exist, and if it does, confirm the contents match what we read from zfs. If they don't it should crash out - commit https://github.com/dotmesh-io/dotmesh/commit/24cb41e39acd5a19d576d16fc6c20ae6ebe6fcde

For now think this will do, followed code paths etc and I can't see anything obviously wrong at either end which would indicate why this is happening. If it happens again, let's dig in.