Open cgundy opened 4 months ago
Hey @cgundy,
I am failing to reproduce the issue. I forced the runner internal node version like you specified, and this is the output:
Did you try using the latest runner image? Please let me know :relaxed:
Hi @nikola-jokic, thank you very much for testing it out. Yes, I am using the latest runner image v2.314.1
. I noticed that the checkout succeeds on a simpler runner setup that does not use a container hook template, but fails on my more complex setup. These are the full container specs I am using:
template:
spec:
securityContext:
fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:2.314.1
imagePullPolicy: IfNotPresent
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/custom-config.yaml
- name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER
value: "true"
- name: ACTIONS_RUNNER_PREPARE_JOB_TIMEOUT_SECONDS
value: "300"
- name: ACTIONS_RUNNER_FORCED_INTERNAL_NODE_VERSION
value: node20
resources:
requests:
memory: 1Gi
volumeMounts:
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
Did you also test on a runner that uses ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
? Or do you see anything obviously wrong with my config? Thanks a lot! 🙏
I have not, but it would depend on what the template is, right? The hook template modifies the job pod that you specify. So if the spec for the new pod is invalid, then the action would fail. But I'm more worried about the ACTIONS_RUNNER_USE_KUBE_SCHEDULER
. If you are using ReadWriteMany
volume, could you check that the hook has the permission to read from it? Since the job pod is up, I assume that your hook template is okay. So, the problem may be with the ReadWriteMany
volume, but I'm not sure. If you fail to determine the problem, could you please send the extension and the volume spec that you are using, so I can try to reproduce it? Thanks!
Hi, thanks for the quick response. I think you're onto something. I tested checkout v4 without using ACTIONS_RUNNER_USE_KUBE_SCHEDULER
and only ReadWriteOnce
and it succeeded. So it seems this is where the issue lies. However, the setup worked for checkout v3, so I don't understand where the permissions issues would come from nor do I see anything related to this in the logs.
For completeness, here is my pod template:
apiVersion: v1
kind: PodTemplate
metadata:
labels:
app: runner-pod-template
spec:
securityContext:
runAsUser: 1001
fsGroup: 1001
containers:
- name: $job
securityContext:
privileged: true
volumeMounts:
- name: var-sysimage
mountPath: /var/sysimage
- name: var-tmp
mountPath: /var/tmp
resources:
requests:
memory: 20Gi
volumes:
- name: var-sysimage
emptyDir:
medium: Memory
readOnly: false
- name: var-tmp
emptyDir: {}
readOnly: false
And I am using cephfs as a storage class for ReadWriteMany
:
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: cephfs
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 3
dataPools:
- name: replicated
replicated:
size: 3
preserveFilesystemOnDelete: false
metadataServer:
activeCount: 1
activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: cephfs
pool: cephfs-replicated
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete
I'd rather not change our storageclass since it has been working well with this setup otherwise, but am open to any suggestions or debugging steps I can take.
@nikola-jokic this is still an ongoing issue for us. We've tried to use checkout@v3, but now we're in a situation where we need to upgrade. I've checked that the permissions are all correct. If you have any suggestions for debugging steps please let me know, as the only option we may have is to not use the kube scheduler anymore, or move to dind.
Could you share your workflow file? Did you manage to create a reproducible issue? I'm wondering if node we mount is the issue, but I'm not sure here but I'm not sure. It works for ubuntu image so maybe the check which node to mount is wrong (We compile node for alpine to mount it to alpine based containers)
When trying to upgrade the GitHub
checkout
action from v3 to v4 using self-hosted runners with Kubernetes mode, I consistently get the following error:I've tried upgrading the internal runner node version from 16 to 20 using:
But I still see the same error. I believe this is a somewhat urgent issue as GitHub actions won't support node16 after Spring 2024 anymore (post) and we will need to upgrade
checkout
actions from v3 to v4.Thank you!