Fix concurrent Multipart Upload

AliSoftware commented 4 months ago

Ref to private thead discussing the issue: p1717455440382939/1714552456.261469-slack-CC7L49W13

Context

The way our git mirror caching mechanism works is that, before cloning a repo:

In case of git mirror cache miss, we check if there's a git-{slugified-repo-url}-{year}-{month}.aar file for the current month and if so we download it, uncompress it, and use it as --reference during subsequent git clone.
But if there isn't such .aar cache for that month, the CI job will compress the current git repo into an .aar and upload it to S3 for subsequent CI jobs to use it as a reference for their subsequent git clones.

This means that a new git cache is only uploaded on the start of each month (which is plenty enough for our needs).

The issue

On the start of this month (2024.06), when the 2024-06 cache was detected as not existing yet by multiple jobs, they all started to compress their local git clone into .aar and trying to upload the archive to S3 likely in parallel. This resulted in the following errors in our CI logs:

Publishing the Git Mirror for git@github.com:tumblr/orangina.git
"not found"
Compressing /opt/ci/git-mirrors/git-github-com-tumblr-orangina-git/ to file:///opt/ci/var/tmp/git-github-com-tumblr-orangina-git-2024-06.aar
Uploading mirror to git-github-com-tumblr-orangina-git-2024-06.aar
99.65% [121 MB/s, About 0s remaining]
Error: One or more of the specified parts could not be found.  The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.

The theory

This didn't happen in the past months, because up until https://github.com/Automattic/hostmgr/pull/89, all our multipart uploads (MPUs) started from scratch, so each CI job triggered its own MPU for the {slugified-repo-url}-{year}-{month}.aar object key in S3, each initiating a CreateMultipartUpload request with its own separate MPU ID. So those didn't conflict.

But after https://github.com/Automattic/hostmgr/pull/89, in order to try and resume a pending MPU that might have been interrupted or might have failed mid-flight (which was common when we use MPU to upload large Xcode VM images to S3 from our individual machines, where it's much less likely that we'd have a race condition of two humans deciding to upload the same new Xcode VM to S3 at the same time from different machines), the new implementation calls the ListMultipartUploads API to check if there's a pending MPU for that object key… and if there is, checks the checksums of already-uploaded parts then resume the upload only for the missing parts.

This means that CI job 1 might start an MPU (and get a specific MPU ID for it), start uploading some parts, and then CI job 2 might also want to start uploading the git cache .aar for the same object key, and will first ask ListMultipartUploads if there's a pending MPU for it, find it, and thus use that same MPU ID to continue the upload that CI job 1 started… while CI job 1 is still also uploading the same parts to S3 for that same MPU ID in parallel.

I'm guessing that's the reason why we end up with one of the jobs ending in error complaining about incorrect ETags being used in the CompleteMultipartUpload request at the end.

Ideas for the fix

The MPU resume feature was implemented only having the case of uploading large Xcode VMs manually from one's local machine, where there would only be one human trying to update a given object key (e.g. xcode-15.4.vmtemplate.aar) at once.

But in practice we need to ensure that we also handle all the use cases for git cache too. Here are a couple of alternative ideas for approaching that:

1: Mutex mechanism

Implement a locking/mutex mechanism (write-barrier), where if one CI job starts writing/uploading an updated .aar git cache for a given object key in S3, the other jobs would wait for it to finish before re-checking their cache (and typically detecting that the .aar cache for that month is now available and read it, instead of trying to upload their own .aar in turn).

This would be an ideal solution in terms of performance, because it would avoid wasting bandwidth in uploading the exact same file to S3 multiple times (by multiple jobs).
But implementing such a solution might be tricky (because it would require the mutex to be shared by multiple jobs, so we might need some dynamodb to implement it—like we do for the terraform state lock—or the like?

2: Only enable resume for Xcode VMs, not git cache

We could decide to only use the resume mechanism of MPUs for hostmgr vm publish (intended to be run only by one human at a time on their local machine) but not for hostmgr cache publish-git-mirror (which is run by CI and can thus be run by multiple CI jobs concurrently).

This means we'll go back to how things were behaving for git cache before #89 landed—which means the same git cache and .aar file was uploaded to S3 multiple times, by multiple CI jobs that might be running in parallel; but at least they'd use distinct MPU IDs to upload the same object key concurrently, which is valid and supported by S3, so it will be an easy-to-implement solution that we know will work… even if a bit wasteful in terms of bandwidth.

3: Use a local JSON file to track pending MPUs initiated by that machine

Probably the best of both worlds (?), use a local JSON file (/opt/ci/var/pending-uploads.json?) to keep track of the MPUs that have been initiated by the current machine locally.

We could then use this to:

Make the implementation of MPU resume feature to only allow resuming MPUs that are returned by ListMultipartUploads API if the MPU ID is also listed in that local file (i.e. only if that MPU was initiated by the local machine and not by another, different Mac or CI job). That would avoid the ETag conflict issue (and sharing use of same MPU IDs between machines), though would still have the drawback of solution 2 (each concurrent CI job uploading the same file to S3, wasting bandwidth)
Or use that info kind of as the mutex from solution 1
- i.e. if ListMultipartUploads returns a pending upload for that object key but we don't have the corresponding MPU ID in our local JOB file, it'd be a hint that another job is already in the process of uploading/updating that file/object key.
- In such case, we could make the job wait a bit, and re-check the cache status after some time T (in the hope that the other CI job which were taking care of that pending MPU and cache update has finished uploading it).
- If after N re-checks (with exponential back-off?) the object key is still not present, assume that the job 1 that was taking care of that MPU might ended up being cancelled or aborted mid-flight, and thus allow job 2 to upload its own .aar file after all (but initiating a brand new MPU with a separate MPU ID to still avoid risk of ETag clashes).

That way this would be a middle ground between "try to only let once CI job upload the git cache and have only one upload of the same git cache overall if possible (to avoid wasting bandwidth and S3 upload costs)" and "still allow multiple uploads of same object keys as a fallback in case the first job that started the upload was interrupted mid-flight and would otherwise block all other jobs (similar to if we took the mutex but never released it)"

crazytonyli commented 3 months ago

🙈 I'm not sure if this counts as cheating, but... what do you think adding an option to use awscli? Like hostmgr cache publish-git-mirror --use-awscli [/path/to/bin/aws]? And internally hostmgr can call aws s3 cp /path/to/.aar s3://..../

AliSoftware commented 3 months ago

Interesting idea… but I'm not sure it would bring us much compared to just disable resume feature when calling tinys3 from publish-git-mirror command?

When I implemented the resume operation, I already exposed an allowResume: Bool = true parameter to the S3Client method and to the S3Server method that calls it, so that the MultipartUploadOperation logic can disable that feature on demand if we want to

So all there's to do to disable resume for publish-git-mirror (aka to implement proposition 2 from my comment above) would be to add allowResume: false to this call site and that should do it.

iangmaia commented 2 months ago

Given we implemented 2: Only enable resume for Xcode VMs, not git cache on https://github.com/Automattic/hostmgr/pull/103, I will close this issue for now.

Automattic / hostmgr