docker / setup-buildx-action

GitHub Action to set up Docker Buildx
https://github.com/marketplace/actions/docker-setup-buildx
Apache License 2.0
952 stars 149 forks source link

hangs for 2 minutes while storing buildx in cache #293

Closed nicks closed 8 months ago

nicks commented 8 months ago

Contributing guidelines

I've found a bug, and:

Description

setup-buildx-action consistently hangs for 2 minutes.

Expected behaviour

No pause

Actual behaviour

pause

Repository URL

private

Workflow run URL

private

YAML workflow

Relevant snippet:


      - name: Docker Login
        uses: docker/login-action@v2
        with:
          username: my_username
          password: ${{ secrets.MY_PAT }}

      - name: Setup Hydrobuild
        uses: docker/setup-buildx-action@v3
        with:
          version: "lab:latest"
          driver: cloud
          endpoint: my-endpoint
          install: true

Workflow logs

::group::Inspect builder
Thu, 25 Jan 2024 15:18:40 GMT
Inspect builder
...
Thu, 25 Jan 2024 15:18:40 GMT
  ::endgroup::
Thu, 25 Jan 2024 15:20:48 GMT
##[debug]Node Action run completed with exit code 0

(Note the long pause between inspecting the builder and the node action completing)

BuildKit logs

No response

Additional info

@crazy-max spent some time poking at this and couldn't figure it out. but some additional data points:

nicks commented 8 months ago

hmmm moving this out of vpn didn't fix the problem, not i'm even more confused. need to dig in more.

nicks commented 8 months ago

(i think there's some sort of caching issue at play that made it seem like doing it outside of vpn fixed it)

nicks commented 8 months ago

OK, i'm 90% sure the problem is this line -- https://github.com/docker/actions-toolkit/blob/9282d3e13b5c0ff919207d30e173b0504a8cd8df/src/buildx/install.ts#L334

if i comment out that line, the problem goes away. when it uploads the file, GHA does some sort of work in the background to keeps the step open. maybe the cache save is just slow?

nicks commented 8 months ago

(updated the description to make it clear this is not a vpn problem, but now i don't really know why i can consistently reproduce this in one repo but not others)

nicks commented 8 months ago

ahhh i think this might be a github actions bug

Screenshot from 2024-01-29 17-29-33

my new theory is that 1) cache.save() is triggering a GC 2) the GC is borked and just causes pauses

nicks commented 8 months ago

what if we move the cache.save() into the post part of the job? i think that's technically more correct anyway -- it's what other github actions stuff does

nicks commented 8 months ago

posted this change that at least isolates the weird behavior - https://github.com/docker/actions-toolkit/pull/241

nicks commented 8 months ago

one thing i don't really understand about setup-buildx-action's caching strategy is that it's creating a separate cache for each PR branch. (see screenshot)

Screenshot from 2024-01-31 12-26-07

wouldn't it make more sense for the cache to be global for the repo?

nicks commented 8 months ago

ok, we figured out the root cause. we don't think this is a setup-buildx-action issue. writing down the problem in case we see it again:

github actions uses a layered cache approach. each branch can write to exactly one cache (the cache "owned" by the branch). But each branch implicitly reads from multiple caches based on the branch structure. So if you're on a feature branch, that branch can read from main and feature.

for more info on this, see: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache

this works really well if you run a github action on both pull and push (which most ci systems built on github actions do)

the repo where we were having this problem only ran setup-buildx-action on pull, not on push.

this leads to weird and pathological cache behavior, because every branch starts with an empty cache. Common values never get written back to the main branch cache. And I think it's leading to the github actions cache implementation behaving very strangely with weird pauses, etc because it has hundreds of disconnected caches, each with their own duplicate copy of buildx.

i still think it might make sense to do move the cache save call until the post-action step (like other github actions do), but let's close this issue.