Open DaveCTurner opened 1 year ago
Pinging @elastic/es-distributed (Team:Distributed)
There's another option here I believe. We could simply make uploads for Azure and GCS work the same way they work for S3 and not have the collision check in the first place. If we believe the S3 repository implementation is sufficiently safe, there's not that much point in doing things differently in GCS+Azure is there? That would allow for quite a bit of code simplification as well and would resolve this issue without having to add any new functionality.
Definitely makes sense for the shard-level metadata since they're named with UUIDs. Not sure about the top-level one tho, I think the S3 version is the best we can do there but the extra protection on GCS & Azure has value.
We upload metadata blobs to GCS and Azure with a precondition check that prevents overwriting existing blobs, to try and avoid concurrent repository modifications causing unrecoverable corruption. However we also retry failed uploads up to 3 times (on GCS). It's possible for an upload to succeed without us receiving the success message, triggering a retry which will then fail the precondition check since the previous upload succeeded. Empirically, this seems to happen roughly every few hours across all the clusters to which I have access:
I think we should account for this case better. If we get an inconclusive response to a write operation (e.g. network timeout or disconnect, or probably most 5xx errors) and the retry fails its precondition check then I believe it would be safe to check to see whether the contents of the blob are what we wanted them to be. We could also perhaps retry the whole finalisation process (with new generations as needed) although this would be a bigger change.