Closed AliSoftware closed 2 months ago
🙈 I'm not sure if this counts as cheating, but... what do you think adding an option to use awscli? Like hostmgr cache publish-git-mirror --use-awscli [/path/to/bin/aws]
? And internally hostmgr can call aws s3 cp /path/to/.aar s3://..../
Interesting idea… but I'm not sure it would bring us much compared to just disable resume feature when calling tinys3
from publish-git-mirror
command?
When I implemented the resume operation, I already exposed an allowResume: Bool = true
parameter to the S3Client
method and to the S3Server
method that calls it, so that the MultipartUploadOperation
logic can disable that feature on demand if we want to
So all there's to do to disable resume for publish-git-mirror
(aka to implement proposition 2 from my comment above) would be to add allowResume: false
to this call site and that should do it.
Given we implemented 2: Only enable resume for Xcode VMs, not git cache
on https://github.com/Automattic/hostmgr/pull/103, I will close this issue for now.
Ref to private thead discussing the issue: p1717455440382939/1714552456.261469-slack-CC7L49W13
Context
The way our git mirror caching mechanism works is that, before cloning a repo:
git-{slugified-repo-url}-{year}-{month}.aar
file for the current month and if so we download it, uncompress it, and use it as--reference
during subsequentgit clone
..aar
cache for that month, the CI job will compress the current git repo into an.aar
and upload it to S3 for subsequent CI jobs to use it as a reference for their subsequent git clones.This means that a new git cache is only uploaded on the start of each month (which is plenty enough for our needs).
The issue
On the start of this month (2024.06), when the
2024-06
cache was detected as not existing yet by multiple jobs, they all started to compress their local git clone into.aar
and trying to upload the archive to S3 likely in parallel. This resulted in the following errors in our CI logs:The theory
This didn't happen in the past months, because up until https://github.com/Automattic/hostmgr/pull/89, all our multipart uploads (MPUs) started from scratch, so each CI job triggered its own MPU for the
{slugified-repo-url}-{year}-{month}.aar
object key in S3, each initiating aCreateMultipartUpload
request with its own separate MPU ID. So those didn't conflict.But after https://github.com/Automattic/hostmgr/pull/89, in order to try and resume a pending MPU that might have been interrupted or might have failed mid-flight (which was common when we use MPU to upload large Xcode VM images to S3 from our individual machines, where it's much less likely that we'd have a race condition of two humans deciding to upload the same new Xcode VM to S3 at the same time from different machines), the new implementation calls the
ListMultipartUploads
API to check if there's a pending MPU for that object key… and if there is, checks the checksums of already-uploaded parts then resume the upload only for the missing parts.This means that CI job 1 might start an MPU (and get a specific MPU ID for it), start uploading some parts, and then CI job 2 might also want to start uploading the git cache
.aar
for the same object key, and will first askListMultipartUploads
if there's a pending MPU for it, find it, and thus use that same MPU ID to continue the upload that CI job 1 started… while CI job 1 is still also uploading the same parts to S3 for that same MPU ID in parallel.I'm guessing that's the reason why we end up with one of the jobs ending in error complaining about incorrect ETags being used in the
CompleteMultipartUpload
request at the end.Ideas for the fix
The MPU resume feature was implemented only having the case of uploading large Xcode VMs manually from one's local machine, where there would only be one human trying to update a given object key (e.g.
xcode-15.4.vmtemplate.aar
) at once.But in practice we need to ensure that we also handle all the use cases for git cache too. Here are a couple of alternative ideas for approaching that:
1: Mutex mechanism
Implement a locking/mutex mechanism (write-barrier), where if one CI job starts writing/uploading an updated
.aar
git cache for a given object key in S3, the other jobs would wait for it to finish before re-checking their cache (and typically detecting that the.aar
cache for that month is now available and read it, instead of trying to upload their own.aar
in turn).2: Only enable resume for Xcode VMs, not git cache
We could decide to only use the resume mechanism of MPUs for
hostmgr vm publish
(intended to be run only by one human at a time on their local machine) but not forhostmgr cache publish-git-mirror
(which is run by CI and can thus be run by multiple CI jobs concurrently).This means we'll go back to how things were behaving for git cache before #89 landed—which means the same git cache and
.aar
file was uploaded to S3 multiple times, by multiple CI jobs that might be running in parallel; but at least they'd use distinct MPU IDs to upload the same object key concurrently, which is valid and supported by S3, so it will be an easy-to-implement solution that we know will work… even if a bit wasteful in terms of bandwidth.3: Use a local JSON file to track pending MPUs initiated by that machine
Probably the best of both worlds (?), use a local JSON file (
/opt/ci/var/pending-uploads.json
?) to keep track of the MPUs that have been initiated by the current machine locally.We could then use this to:
ListMultipartUploads
API if the MPU ID is also listed in that local file (i.e. only if that MPU was initiated by the local machine and not by another, different Mac or CI job). That would avoid the ETag conflict issue (and sharing use of same MPU IDs between machines), though would still have the drawback of solution 2 (each concurrent CI job uploading the same file to S3, wasting bandwidth)ListMultipartUploads
returns a pending upload for that object key but we don't have the corresponding MPU ID in our local JOB file, it'd be a hint that another job is already in the process of uploading/updating that file/object key..aar
file after all (but initiating a brand new MPU with a separate MPU ID to still avoid risk of ETag clashes).That way this would be a middle ground between "try to only let once CI job upload the git cache and have only one upload of the same git cache overall if possible (to avoid wasting bandwidth and S3 upload costs)" and "still allow multiple uploads of same object keys as a fallback in case the first job that started the upload was interrupted mid-flight and would otherwise block all other jobs (similar to if we took the mutex but never released it)"