fastlane / docs

All the fastlane docs
https://docs.fastlane.tools
316 stars 521 forks source link

Bad concurrency for prod deploys #1245

Open janbrasna opened 4 months ago

janbrasna commented 4 months ago

With many PRs merged in succession and the time taken in CI before checking out and trying to push to gh-pages after building, if there are more jobs running at the same time, you obviously run into the issue:

[gh-pages d8c30f64] Deployed with mkdocs, version 1.1.2 from /home/circleci/.local/share/virtualenvs/code-6yRgnUSz/lib/python3.8/site-packages/mkdocs (Python 3.8)
 552 files changed, 859 insertions(+), 859 deletions(-)
To github.com:fastlane/docs.git
 ! [rejected]          gh-pages -> gh-pages (fetch first)
error: failed to push some refs to 'git@github.com:fastlane/docs.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Exited with code exit status 1

So you effectively end up not having the merged changeset published at that point. You can only hope the next push to master won't take too long to happen, to incorporate all the previous (failed) deploys to prod with it… 🤷

I'm not a CircleCI expert so take this with a pinch of salt, but… it seems the earliest commit "wins" here trying to deploy to prod, whereas normally you'd have the most recent cancelling the previous ones and eventually "winning" in the priority to deploy, not being blocked by the previous ones running concurrently to cause conflicts at the end.

rogerluan commented 3 months ago

Interesting issue! Despite being less ideal, in this case I think we could fix it by having newest commits cancelling previous ongoing builds 👀 that'd effectively solve the problem and I don't see significant drawbacks.

Not sure how to achieve this with CircleCI though, and I won't have time to investigate this any time soon 😥 happy to review PRs or other changes in the meantime though!

janbrasna commented 3 months ago

I'm used to the behaviour needed in GHA but it seems it's not exactly that straightforward in CircleCI:

So my take would simply be: https://github.com/fastlane/docs/blob/54969f497ed79d396434ffd2e4a77bb21dcce8a6/scripts/ci/deploy.sh#L48-L49 --force

but only in master context / publish CI, not when run otherwise, manually/localy etc. as there might be more users of the script — so I'm not confident to just propose -f there and call it a day. Leaving that to others to come up with something maybe more sophisticated;]

(This would be still far from perfect, as that doesn't prefer the build that starts last, but one that finishes last, and that's a huge difference;)… throw in some timeout, connection/performance or cache woes like lately, and you can have an older commit overwriting the output of a newer one just by getting stuck for a bit longer in there…) 🤷‍♂️

rogerluan commented 3 months ago

Thanks for digging that info for CircleCI. It seems like they don't offer "auto cancel builds" which's kinda underwhelming 🤕 I wouldn't expect that.

Some alternative solutions:

Thoughts?

janbrasna commented 3 months ago

Yea we've had race conditions e.g. where a workflow would need a docker built from the same sha that might not have already been published to the registry, so the cron fallback for failed pipelines sounds uncomfortably familiar;]

The build is simple enough to be pushed straight to a deployment environment via GHA, getting rid of the gh-pages branch and its underlying git tree completely, and I'd welcome that — but I don't think you can depend GHA running only if previous checks i.e. CircleCI build&test pass. The containerised fastlanetools/ci test image is just docker anyways so that shouldn't be too prohibitive to move that also to GHA, keeping the whole CI just here… but it would mean disjoining pipelines from fastlane/fastlane which is kinda 💩…

janbrasna commented 3 months ago

But the problem is pretty trivial in this case. The bundler woes slowed down the CI and it took ~10mins and more from initial checkout to the actual switch & commit step, so before resorting to bigger changes or force pushing I'd just try #1250 adding an extra fetch — to check out fresh gh-pages tip instead of the head that's been lying around for minutes already… (at the same time the current bundler version resolves take only seconds, so that should help avoiding conflicts too…)