dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
538 stars 29 forks source link

Deletion flakiness #635

Closed lukemarsden closed 5 years ago

lukemarsden commented 5 years ago
project workspace deletion failed: Response '{"jsonrpc":"2.0","error":{"code":-32000,"message":"Key already exists","data":null},"id":1366189801693462530}
' yields error Key already exists

when trying to delete https://cloud.dotscience.net/projects/e23fe3b4-eb4a-41d3-814b-1a774288cb54/settings/

alaric-dotmesh commented 5 years ago

Sadly, this cannot be reproduced as-is; the project deleted fine when I tried it. However, somewhat expecting this, I did some spelunking first. The project workspace dot already didn't exist in dotmesh.

So: I conclude that the deletion had failed and left the dot in one of the many possible half-deleted states, and that either a dotmesh restart on production or the periodic mechanism that finishes half-deleted deletions after a timeout came along and cleared up.

So, going back to the original error of "Key already exists", I've seen this before with partial deletions: I think it comes from a case where a deletion hanged, leaving the dot still in dm list, but attempting to delete it again fails with "Key already exists" because the marker stating that the filesystem(s) in the dot are going already exist. I'm now digging into the code to find those paths, and I'm going to see if I can use a strategy similar to wound/wait in concurrent algorithms: if a deletion finds those markers already present, it's either picking up after a failed/hung deletion, or the same dot is just being deleted "twice at once" (there's no easy way of telling the difference), so rather than give up I should proceed with the deletion but not be surprised if some deletion steps are already done (where it's safe to do so).

alaric-dotmesh commented 5 years ago

Ok, looking at the deletion code, there's two changes I need to make for this.

  1. Make the code that sets the deletion flags in etcd not error on the key already existing.
  2. Make sure that waitForFilesystemDeath won't block forever if the filesystem is already gone.

So, having fixed both, I think that we should now be safe(r) in retrying deletions that failed before, and the fix to (2) might also fix a possible race condition that would have made deletions hang in the first place if two of them happened at once or something funny like that.

So, I can't be sure if I've found whatever case failed Luke in the first try at this, but I think I've probably made things a bit better!