actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.57k stars 1.08k forks source link

ReadWriteMany volumes do not work on kubernetes mode #3673

Open flbb opened 1 month ago

flbb commented 1 month ago

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

To Reproduce

1. Run any action with a ReadWriteMany work volume on a Azure kubernetes Cluster
2. Action ends in error in container initialization

Describe the bug

When using ReadWriteMany volume as the work volume on kubernetes mode on a Azure kubernetes cluster the action results in an error:

Run '/home/runner/k8s/index.js'
  shell: /home/runner/externals/node16/bin/node {0}
Error: Error: EPERM: operation not permitted, chmod '/home/runner/_work/externals/node16/bin'
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

When the Volumes Storage Class is set to use uid=1001 gid=1001 on the mountOptions, the action will not start and ends in idling endlessly.

Describe the expected behavior

Expected behavior would be having the job write hello with the following configuration:

name: Testing
run-name: Testing
on:
  pull_request:
    branches:
      - master

permissions:
  contents: read
  id-token: write

jobs:
  testing:
    name: 'testing'
    runs-on: aks-runners-testing
    container:
      image: ghcr.io/actions/actions-runner:latest
    steps:
      - name: Test
        run: |
          echo 'hello'
        shell: bash

Additional Context

Values for the runner-scale-set:

gha-runner-scale-set:
  githubConfigUrl: "---"
  githubConfigSecret: secret
  maxRunners: 30
  minRunners: 0
  containerMode:
    type: kubernetes
  runnerGroup: "aks-runners-group"
  runnerScaleSetName: "aks-runners-testing"
  listenerTemplate:
    spec:
      containers:
      - name: listener
        securityContext:
          runAsUser: 1000
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      containers:
        - name: runner
          image: ghcr.io/actions/actions-runner:latest
          imagePullPolicy: Always
          command: ["/home/runner/run.sh"]
          env:
            - name: ACTIONS_RUNNER_CONTAINER_HOOKS
              value: /home/runner/k8s/index.js
            - name: ACTIONS_RUNNER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
              value: "false"
            - name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER
              value: "true"
          volumeMounts:
            - name: work
              mountPath: /home/runner/_work
      volumes:
        - name: work
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteMany" ]
                storageClassName: aks-runner-sc
                resources:
                  requests:
                    storage: 1Gi

Storage Class used:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: aks-runner-sc
provisioner: file.csi.azure.com
parameters:
  skuName: Standard_LRS
reclaimPolicy: Delete
mountOptions:
  - mfsymlinks
  - actimeo=30
  - dir_mode=0777
  - file_mode=0777
allowVolumeExpansion: true
volumeBindingMode: Immediate

Controller Logs

-

Runner Pod Logs

-
github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

gabrik commented 1 month ago

I'm experiencing the a similar issue on self-hosted cluster with NFS persistent for container storage.

Actually the build proceds fine but then stops on the last step with the same error, so I'm following to see if there is anything that can also help my use-case.