docker / setup-buildx-action

GitHub Action to set up Docker Buildx
https://github.com/marketplace/actions/docker-setup-buildx
Apache License 2.0
952 stars 149 forks source link

Self-Hosted Runners on GHA Workflows with Kubernetes Driver Cancled Context #290

Open pjohnsonrxb opened 9 months ago

pjohnsonrxb commented 9 months ago

Contributing guidelines

I've found a bug, and:

Description

Issue: Self-Hosted Runners on GHA Workflows with Kubernetes Driver

Background

We have configured our GitHub Actions (GHA) workflows to use self-hosted runners. Our typical workflow involves:

Problem

We are encountering an issue when using the Kubernetes (k8s) driver for our builds. Our self-hosted runners are deployed on our k8s cluster. We're experiencing a specific error as shown in the screenshot below:

Error Screenshot

Kubernetes Container Logs:

time="2023-11-30T22:28:11Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"

Hypothesis

We suspect that the issue might be related to our runners being behind a VPN. It seems buildx may not be adequately handling network latency associated with a VPN connection.

Observations

References

For additional context, see this related issue.


Seeking insights or suggestions to resolve this intermittent failure with our self-hosted runners in GHA workflows.

Expected Behavior

When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx, we expect the following:

  1. Stable Connection to Build Services: The runners should maintain a stable connection to Docker's build services, regardless of being behind a VPN. Network latency typically associated with VPN connections should not disrupt the build process.

  2. Consistent Build Process: Each action initiated by the workflow should complete successfully without intermittent failures. The build, push, and cache processes via buildx should be executed reliably.

  3. Error-Free Operation: The buildx command, especially when interacting with Kubernetes, should execute without returning errors like /moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled.

  4. Consistency with GitHub Hosted Runners: The performance and reliability of builds using self-hosted runners should be comparable to those observed with GitHub's hosted runners.

The expectation is that the self-hosted runners on our Kubernetes cluster should work as efficiently and reliably as GitHub's hosted runners, ensuring a smooth CI/CD pipeline.

Actual Behavior

When using self-hosted runners in GitHub Actions workflows with the Kubernetes (k8s) driver for buildx, we are encountering the following issues:

  1. Unstable Connection to Build Services: The runners, especially when operating behind a VPN, are experiencing unstable connections to Docker's build services. This is evident from frequent connection cancellations and errors during the build process.

  2. Inconsistent Build Process: The actions initiated by the workflow are not completing consistently. Approximately 20% of the actions (1 in 5) fail intermittently, showcasing a lack of reliability in the build, push, and cache processes via buildx.

  3. Frequent Errors: We are frequently encountering errors such as /moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled. These errors suggest issues with the interaction between buildx and Kubernetes.

  4. Disparity with GitHub Hosted Runners: Unlike the smooth operation observed with GitHub's hosted runners, our self-hosted runners exhibit inconsistent and error-prone behavior, leading to a disrupted CI/CD pipeline.

In summary, our self-hosted runners on the Kubernetes cluster are not performing as efficiently or reliably as expected, particularly in comparison to GitHub's hosted runners.

Repository URL

No response

Workflow run URL

No response

YAML workflow

name: Build and Push Docker Image

on:
  workflow_call:

jobs:
  build-and-push-image:
    runs-on: [gha-runner-scale-set]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
            fetch-depth: 0

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Set ECR repository name
        id: set_repo_name
        run: |
            REPO_NAME="${{ github.event.repository.name }}"
            ECR_REPO_NAME="${REPO_NAME//./-}"
            echo "ECR_REPO_NAME=$ECR_REPO_NAME" >> $GITHUB_ENV

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: <ecr>/${{ env.ECR_REPO_NAME }}:${{ github.sha }}
          context: .
          build-args: |
              GITHUB_UN=${{ secrets.GITHUBUSERMAME }}
              GITHUB_PW=${{ secrets.GITHUBPASSWORD }}
          cache-from: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache
          cache-to: type=registry,ref=<ecr>/${{ env.ECR_REPO_NAME }}/cache:dockercache,mode=max,image-manifest=true

Workflow logs

No response

BuildKit logs

No response

Additional info

Also it is important to note that this job only ever cancels when doing build and push. We use actions for other things and the actions never just cancel for no reason.

acrogenesis commented 4 months ago

We get the same problem in our ARM64 self-hosted workflows, although our k8s cluster is not behind a VPN

elocke commented 2 months ago

We're seeing the same issues, both with and without buildx. Can't pinpoint an exact cause. On AWS behind a VPC/transit gateway etc but no VPN. platform: amd64

0xLE commented 2 months ago

Try specifying the builder explicitly:

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  id: builder

- name: Build and push
  uses: docker/build-push-action@v6
  with:
    # ...
    builder: ${{ steps.builder.outputs.name }}
andresrsanchez commented 2 months ago

Same issue here, solved with a retry step :(

davhdavh commented 1 month ago

same problem with self-hosted windows build-runner sending context to linux buildkitd on same LAN.