Closed nicks closed 8 months ago
hmmm moving this out of vpn didn't fix the problem, not i'm even more confused. need to dig in more.
(i think there's some sort of caching issue at play that made it seem like doing it outside of vpn fixed it)
OK, i'm 90% sure the problem is this line -- https://github.com/docker/actions-toolkit/blob/9282d3e13b5c0ff919207d30e173b0504a8cd8df/src/buildx/install.ts#L334
if i comment out that line, the problem goes away. when it uploads the file, GHA does some sort of work in the background to keeps the step open. maybe the cache save is just slow?
(updated the description to make it clear this is not a vpn problem, but now i don't really know why i can consistently reproduce this in one repo but not others)
ahhh i think this might be a github actions bug
my new theory is that 1) cache.save() is triggering a GC 2) the GC is borked and just causes pauses
what if we move the cache.save() into the post part of the job? i think that's technically more correct anyway -- it's what other github actions stuff does
posted this change that at least isolates the weird behavior - https://github.com/docker/actions-toolkit/pull/241
one thing i don't really understand about setup-buildx-action's caching strategy is that it's creating a separate cache for each PR branch. (see screenshot)
wouldn't it make more sense for the cache to be global for the repo?
ok, we figured out the root cause. we don't think this is a setup-buildx-action issue. writing down the problem in case we see it again:
github actions uses a layered cache approach. each branch can write to exactly one cache (the cache "owned" by the branch). But each branch implicitly reads from multiple caches based on the branch structure. So if you're on a feature branch, that branch can read from main and feature.
for more info on this, see: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
this works really well if you run a github action on both pull and push (which most ci systems built on github actions do)
the repo where we were having this problem only ran setup-buildx-action on pull, not on push.
this leads to weird and pathological cache behavior, because every branch starts with an empty cache. Common values never get written back to the main branch cache. And I think it's leading to the github actions cache implementation behaving very strangely with weird pauses, etc because it has hundreds of disconnected caches, each with their own duplicate copy of buildx.
i still think it might make sense to do move the cache save call until the post-action step (like other github actions do), but let's close this issue.
Contributing guidelines
I've found a bug, and:
Description
setup-buildx-action consistently hangs for 2 minutes.
Expected behaviour
No pause
Actual behaviour
pause
Repository URL
private
Workflow run URL
private
YAML workflow
Relevant snippet:
Workflow logs
(Note the long pause between inspecting the builder and the node action completing)
BuildKit logs
No response
Additional info
@crazy-max spent some time poking at this and couldn't figure it out. but some additional data points: