docker / setup-buildx-action

GitHub Action to set up Docker Buildx
https://github.com/marketplace/actions/docker-setup-buildx
Apache License 2.0
952 stars 149 forks source link

Action sporadically fails with exec /usr/bin/buildctl: exec format error #313

Closed clarkohw closed 5 months ago

clarkohw commented 5 months ago

Contributing guidelines

I've found a bug, and:

Description

The docker/setup-buildx-action@v3 sporadically fails on the booting builder step. The sporadic nature of the issue seems similar to https://github.com/docker/setup-buildx-action/issues/283, but i am not using self hosted runners and getting different error messages.

Expected behaviour

The action should install buildx.

Actual behaviour

Occasionally, maybe 10% of the time, the Booting builder step of the action fails.

Repository URL

No response

Workflow run URL

No response

YAML workflow

integration-tests:
    needs: [ setup-matrix, compile-contracts ]
    timeout-minutes: 20
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix: ${{fromJson(needs.setup-matrix.outputs.matrix)}}
    steps:
      - name: Add hosts to /etc/hosts
        run: |
          sudo echo "127.0.0.1 **.local.**.com" | sudo tee -a /etc/hosts
          sudo echo "127.0.0.1 **.local.**.com" | sudo tee -a /etc/hosts

      - name: Checkout ** from ${{ github.event.pull_request.base.ref }}
        uses: actions/checkout@v2

      - name: Cache contract artifacts
        uses: actions/cache@v3
        with:
          fail-on-cache-miss: true
          path: |
            ./abis/
            ./artifacts/
            ./cache/
            ./typechain-types/
          key: ${{ runner.os }}-compiled-contracts-${{ hashFiles('./contracts/') }}

      - name: Set Branch Name
        run: echo "GH_BRANCH_NAME=${{ github.event_name == 'workflow_dispatch' && github.ref_name || github.event.pull_request.base.ref }}" >> $GITHUB_ENV

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11.5'
          cache: 'pip'
          token: ${{ secrets.GH_ADMIN_TOKEN }}
      - run: pip install -r requirements.txt

      - name: Set up Docker
        uses: docker/setup-buildx-action@v3

      - name: Set up AWS CLI
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{env.AWS_REGION}}

      - name: Use Node.js ${{ env.NODE_VERSION }}
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'yarn'
      - run: yarn install
      - run: yarn global add pm2

      - name: Install Foundry
        uses: foundry-rs/foundry-toolchain@v1

      - uses: nick-fields/retry@v2
        with:
          timeout_minutes: 10
          max_attempts: 2
          command: python3 -u run.py ${{ matrix.test }} --docker

Workflow logs

Run docker/setup-buildx-action@v3
Docker info
Buildx version
Creating a new builder instance
  /usr/bin/docker buildx create --name builder-20e51f00-009d-499f-ba6b-ec39d5720f3f --driver docker-container --buildkitd-flags --allow-insecure-entitlement security.insecure --allow-insecure-entitlement network.host --use
  builder-20e51f00-009d-499f-ba6b-ec39d5720f3f
Booting builder
  /usr/bin/docker buildx inspect --bootstrap --builder builder-20e51f00-009d-499f-ba6b-ec39d5720f3f
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 0.2s done
  #1 creating container buildx_buildkit_builder-20e51f00-009d-499f-ba6b-ec39d5720f3f0
  #1 17.79 time="2024-04-08T14:25:[18](https://github.com/**/**/actions/runs/8601790451/job/23569835513#step:7:19)Z" level=warning msg="using host network as the defaul#1 creating container buildx_buildkit_builder-20e51f00-009d-499f-ba6b-ec39d5720f3f0 17.6s done
  time="2024-04-08T14:25:18Z" level=warning msg="using host network as the default"
  #1 17.79 time="2024-04-08T14:25:18Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
  #1 17.79 dtime="2024-04-08T14:25:18Z" level=info msg="found 1 workers, default=\"lh4fhblojyqn1krqg9m33sxft\""
  #1 17.79 `time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  #1 17.79 \time="2024-04-08T14:25:18Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
  #1 17.79 time="2024-04-08T14:25:18Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
  #1 17.79 time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  #1 17.79 time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  #1 17.79 exec /usr/bin/buildctl: exec format error
  #1 ERROR: exit code 1
  ------
   > [internal] booting buildkit:
  time="2024-04-08T14:25:18Z" level=warning msg="using host network as the default"
  17.79 time="2024-04-08T14:25:18Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
  17.79 dtime="2024-04-08T14:25:18Z" level=info msg="found 1 workers, default=\"lh4fhblojyqn1krqg9m33sxft\""
  17.79 `time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  17.79 \time="2024-04-08T14:25:18Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
  17.79 time="2024-04-08T14:25:18Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
  17.79 time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  17.79 time="2024-04-08T14:25:18Z" level=warning msg="currently, only the default worker can be used."
  17.79 exec /usr/bin/buildctl: exec format error
  ------
  ERROR: exit code 1
Error: The process '/usr/bin/docker' failed with exit code 1

but i also recently go this error message:

Run docker/setup-buildx-action@v3
Docker info
Buildx version
Creating a new builder instance
Booting builder
  /usr/bin/docker buildx inspect --bootstrap --builder builder-4f1b9e61-ada5-4527-ae74-af370ab097db
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 0.4s done
  #1 creating container buildx_buildkit_builder-4f1b9e61-ada5-4527-ae74-af370ab097db0
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 creating container buildx_buildkit_builder-4f1b9e61-ada5-4527-ae74-af370ab097db0 17.6s done
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  #1 18.04 
  #1 ERROR: Error response from daemon: Container 38688efaa8949e35e2e2ff6d861513afbc4cfbdd24e4e40366dd65ef2d6b05dc is restarting, wait until the container is running
  ------
   > [internal] booting buildkit:
  18.04 
  18.04 
  18.04 /usr/bin/buildkitd: line 0: syntax error: unexpected word (expecting ")")
  ------
  ERROR: Error response from daemon: Container 38688efaa8949e35e2e2ff6d861513afbc4cfbdd24e4e40366dd65ef2d6b05dc is restarting, wait until the container is running

BuildKit logs

No response

Additional info

One potentially relevant factor is that we run many workflows at the same time (>20) at some times so I was thinking it could be related to that?

gete76 commented 5 months ago

+1 . We also run both GH and self hosted runners with many parallel workflow runs that use this action. A cache issue? Pinning to an older version that's been stable for us has patched the issue for us:

  - name: Set up Docker Buildx
    uses: docker/setup-buildx-action@v3
    with:
      platforms: linux/amd64
      version: v0.11.2
      buildkitd-flags: --debug
      driver-opts: image=moby/buildkit:v0.11.2
      cache-binary: false
tonistiigi commented 5 months ago

We also run both GH and self hosted runners with many parallel workflow runs that use this action. A cache issue? Pinning to an older version that's been stable for us has patched the issue for us:

Let us know if this shows up in older version as well. There is nothing atm pointing to issue with our release and parallel workflow runs are out of our control as well.

I have 100 clean runs in a row in https://github.com/tonistiigi/gh-exec-format-error-debug/actions/runs/8606687477 based on another report. If you can point me any differences what should be tried instead to reproduce this then lmk.

osarobo commented 5 months ago

For those experiencing this issue, I think @tonistiigi may have used the new updated runner build released yesterday, see https://github.com/actions/runner-images/releases.

Try again with the latest docker/setup-buildx-action@v3 version and see if you are still having the unexpected behaviour.

gete76 commented 5 months ago

We also run both GH and self hosted runners with many parallel workflow runs that use this action. A cache issue? Pinning to an older version that's been stable for us has patched the issue for us:

Let us know if this shows up in older version as well. There is nothing atm pointing to issue with our release and parallel workflow runs are out of our control as well.

I have 100 clean runs in a row in https://github.com/tonistiigi/gh-exec-format-error-debug/actions/runs/8606687477 based on another report. If you can point me any differences what should be tried instead to reproduce this then lmk.

Last Friday, the error started showing up very pronounced in our CI Merge queue. It was causing almost all merge queue runs to be booted by the end of the day. Reading up on this error log message "Error: The process '/usr/bin/docker'" and other messages about default network, seemed to point to an issue of matching versions of buildkit with buildx. I tried running action with just the --cache-binary=false to test it with the latest packages, hoping it was a cache issue but the error still showed up.

tonistiigi commented 5 months ago

@gete76 And you still see the issue?

gete76 commented 5 months ago

@tonistiigi , I haven't tested that default setting today because I don't want to disrupt our CI. SLOs and what not. I'll have to find a way to test this without disruption.

gete76 commented 5 months ago

@tonistiigi , I can tell you it did show up yesterday morning around 11AM EST, when I tested the latest (default) packages with no caching.

gete76 commented 5 months ago

For those experiencing this issue, I think @tonistiigi may have used the new updated runner build released yesterday, see https://github.com/actions/runner-images/releases.

Try again with the latest docker/setup-buildx-action@v3 version and see if you are still having the unexpected behaviour.

Thanks, I'll give this a try. It does appear that this is only happening on our GH hosted runners. Our internal ones build off of the summerwind action-runner image.

tonistiigi commented 5 months ago

Atm this looks like a Github side issue related to 20240403.1.0 runner release that now looks to be deleted https://github.com/actions/runner-images/blob/ubuntu20/20240403.1/images/ubuntu/Ubuntu2004-Readme.md (404).

This is related issue https://github.com/actions/runner-images/issues/9632 and comment about release being broken https://github.com/actions/runner-images/issues/9654#issuecomment-2042746391

gete76 commented 5 months ago

Confirmed, this new runner image has resolved the issue.