dotmesh-io / dotmesh

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes
https://dotmesh.com
Apache License 2.0
539 stars 29 forks source link

dotmesh pull retry logic gets confused by partial errors when creating a filesystem #753

Open alaric-dotmesh opened 5 years ago

alaric-dotmesh commented 5 years ago

A clone operation failed with a strange error:

time="2019-09-16T09:17:46Z" level=info msg="Downloading workspace & data from hub - downloaded 477.45/5171.85 MiB at 78.62 MiB/s (1/1)"
time="2019-09-16T09:17:46Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:47Z" level=info msg="Transfer status polled" elapsed_ns=6776212170 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive incremental stream: checksum mismatch or incomplete stream\\n\">" sent_bytes=551458107 size_bytes=5423081024 status="retry 1" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69
time="2019-09-16T09:17:47Z" level=info msg="Downloading workspace & data from hub - downloaded 525.91/5171.85 MiB at 77.61 MiB/s (1/1)"

However, the retry loop then tried again - but the original failure had created SOME snapshots, but the retry loop kept trying to create the filesystem from scratch again and failing:

time="2019-09-16T09:17:47Z" level=info msg="Still pulling..." dots_pulling=1
time="2019-09-16T09:17:48Z" level=info msg="Transfer status polled" elapsed_ns=33781652 index=1 message="Attempting to pull d671c4be-fb95-4835-a501-33c707fb66c2 got <Event zfs-recv-failed: err: \"exit status 1\", filesystemId: \"d671c4be-fb95-4835-a501-33c707fb66c2\", stderr: \"cannot receive new filesystem stream: destination 'pool/dmfs/d671c4be-fb95-4835-a501-33c707fb66c2' exists\\nmust specify -F to overwrite it\\n\">" sent_bytes=51 size_bytes=5423081024 status="retry 2" total=1 transfer_id=6edae9a4-7620-4c7a-acd2-a15566221b69

I've not dug into the code, but I suspect the "calculation of what we need to pull" bit isn't being re-done in the retry loop, so a failure that pulls in some snapshots will then cause all subsequent retries to fail as they try and pull in snapshots we've alreay got.

lukemarsden commented 5 years ago

if we could go via discovering state when a fetch fails, we can probably cope better with this scenario