Open clepsag opened 1 year ago
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
We had a similar problem. The cluster was set up zone redundant but the storage was LRS. A new pod came up on another node and the available PV was not attached. With 3 zones the PVs quickly piled up.
We now schedule the runners in one zone only.
Check the attach-detach controller events for additional information.
similar for us.
in our case there could be already 115 “Available” PVs with the name var-lib-docker
. i notice that new PVs are getting created despite the fact that only max 30 pods are requesting a PV using this claim below. our volume claim template on our RunnerSet
resource looks like this:
volumeClaimTemplates:
- metadata:
name: var-lib-docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
storageClassName: var-lib-docker
PVs keep accumulating in this way. any idea why this would happen?
I assume this is related to this issue. The problem is that the StatefulSet isn't being scaled, additional StatefulSets are being added to the RunnerSet. So if the old StatefulSets are being deleted (again, not scaled back) then the PVs will persist.
For us this seems to happen on Azure AKS when the new scaled up pod cannot be immediately scheduled in the cluster and needs to wait for the azure node autoscaling. After AKS added a new node, the new pod is scheduled, but then in the PVC events you can see the following:
Normal WaitForFirstConsumer 4m7s persistentvolume-controller waiting for first consumer to be created before binding
Normal WaitForPodScheduled 2m51s (x7 over 4m7s) persistentvolume-controller waiting for pod poc-runnerset-sj9gd-0 to be scheduled
Normal ExternalProvisioning 2m45s persistentvolume-controller waiting for a volume to be created, either by external provisioner "disk.csi.azure.com" or manually created by system administrator
Warning ProvisioningFailed 2m44s (x2 over 2m45s) disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a failed to provision volume with StorageClass "arc-var-lib-docker": error generating accessibility requirements: no topology key found on CSINode aks-defaultnp-56690459-vmss000008
Normal Provisioning 2m42s (x3 over 2m45s) disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a External provisioner is provisioning volume for claim "actions-runner-system/var-lib-docker-a-poc-runnerset-sj9gd-0"
Normal ProvisioningSucceeded 2m39s disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a Successfully provisioned volume pvc-aa32de10-bbcc-41f9-9dda-ae16ce6075d1
Even though there are plenty of free PV's, every time right after node scale-up the first PVC fails to attach an existing free PV and a new one is created. When a next runner pod is scheduled on this new node, it does manage to attach any of the existing PV's, so it seems like some race condition here between the arc controller pod and the CSI auto provisioner?
Every time after a node scale-up we get one extra PV.
@mhuijgen Did you ever figure out any sort of solution for this? We're running into the exact same issue using EKS with autoscaling and the EBS CSI driver for dynamic PV provisioning.
@benmccown No unfortunately not. The same also occurs occasionally even without node scaling in our tests, making this feature unusable at this time. Nodescaleup is just making the issue appear more often. It seems to be a race condition between the runner controller trying to link the new pvc to an existing volume and the auto provisioner in the cluster creating a new pv.
@mhuijgen Thanks for the response. For yourself (and anyone else who runs into this issue) I think I've come up with the best possible workaround I can think of for the moment, which is basically to abandon the use of dynamic storage provisioning entirely and use static PVs. I'll provide details on our workaround, but first I'll give a brief summary of our setup and use case for ARC maintainers in case they read this.
We are using cluster-autoscaler to manage our EKS autoscaling groups. We have a dedicated node group for our actions runners (I'll use the term CI runners). We use node labels, node taints, and resource requests to manage this node group so that only GitHub CI pods run on the CI runner node group. So each CI pod is running in a 1:1 relationship with nodes (one pod per node). We have 0 set as our minimum autoscaling size for this node group. We're using a RunnerSet
in combination with HorizontalRunnerAutoscaler
to deploy the CI pods in combination with ARC. The final piece of the puzzle is that our CI image is rather heavy at 5GB+.
We're regularly scaling down to zero nodes in periods of inactivity, but we might have a burst of activity where several workflows are kicked off and thus several CI pods are created and scheduled. Cluster autoscaler will then respond in turn and scale out our ASG and join new nodes to the cluster that will execute the CI workloads. Without any sort of image cache we waste ~2m30s for every single CI job to pull our container image into the dind (docker in docker) container within the CI pod. We could set ephemeral: false
in our RunnerSet/RunnerDeployment but that still doesn't solve the autoscaling problem if we scale down and then back up. So we really need image caching to work for us to use autoscaling effectively. We're accomplishing this by mounting ReadWriteOnce
PVs (EBS volumes) to /var/lib/docker
so that each PV can only be mounted once (since sharing /var/lib/docker
is bad practice).
The issue we're seeing has already been detailed well by @mhuijgen and is definitely some sort of race condition as they've said. The result (in my testing) is that in a few short days we had 20+ persistent volumes provisioned and the number was continuing to grow. Aside from the orphaned PVs left around (and resulting EBS volumes) costing us money, the major downside is that it seems almost 100% of the time when a new CI job is scheduled, pod created, and resulting worker node is created (by cluster autoscaler due to scale out operation) it seems a new PV is created and one isn't reused, which completely eliminates the point of an image cache and any of the performance benefits.
The workaround for us is to provision static PVs using Terraform instead of letting the EBS CSI controller manage dynamic volume provisioning.
We're using Terraform to deploy/manage our EKS cluster, EKS node groups, as well as associated resources (helm charts and raw k8s resources too). I set up a basic for loop in Terraform that provisions N
EBS volumes and static PVs where N
is the maximum autoscaling group size for my CI runners node group. Right now this value is set at 10 (minimum autoscaling size is 0) so 10 EBS volumes are provisioned and 10 persistent volumes are then provisioned and tied to the respective EBS volumes, with a storage class of "manual-github-arc-runners" or something to that effect. There will always be an EBS volume and associated PV for every node since they're tied to the same max_autoscaling_size
var in our infrastructure as code.
This way the CSI controller isn't trying to create dynamic PVs at all and the volumes are always reused. So the race condition is eliminated by removing 1 of the 2 parties participating in the "race".
The downsides here are that EBS volumes are availability zone specific, so I have to put the node group in a single subnet and availability zone. And you're paying for the max number of EBS volumes which is a downside I guess, except the bug we're running into with our use case means you'll end up with WAY more volumes than your max autoscaling size eventually anyway.
I'll probably set up a GH workflow that runs 10 parallel jobs daily to ensure the container image is pulled down and up to date on the PVs.
Hope this helps someone in the future.
I think I am hitting the same bug. As far as I can tell, it began after my transition from the built-in EBS provisioner to the EBS CSI provisioner.
For example, using dynamically allocated PV/PVC with a StorageClass that looks like this works correctly (PVs don't build up forever):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
parameters:
fsType: ext4
type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
However, a dynamically allocated PV/PVC with a StorageClass that looks like this builds up PVs:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
parameters:
csi.storage.k8s.io/fstype: xfs
encrypted: "true"
type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
I think this issue is related or a duplicate: https://github.com/actions/actions-runner-controller/issues/2266
Here's some info I found.
Like @mhuijgen, I noticed peculiar warnings in our events:
error generating accessibility requirements: no topology key found on CSINode ip-10-1-38-149.eu-west-1.compute.internal
Each of these warnings coincided with the creation of a new volume. Checking on the CSINode
resource in Kubernetes revealed that the topology keys were set though:
apiVersion: storage.k8s.io/v1
kind: CSINode
# [...]
spec:
drivers:
# [...]
- #[...]
name: ebs.csi.aws.com
topologyKeys:
- topology.ebs.csi.aws.com/zone
So I came to the conclusion there is indeed a race condition: somehow, if the CSI node doesn't have the topology keys set at the moment a volume is requested, then a new volume is created, even though there could be plenty available. This explains why this issue only happens with pods scheduled on fresh nodes.
So I've put in place a workaround. In short, it consists of:
DaemonSet
targeting this label, whose purpose is to wait for the topology keys in the CSI node to be set, at which point it remove the label and taint, thus allowing runners to schedule and preventing itself from scheduling again on the node.So far, it's been working great. Our EBS volumes are consistently being reused.
I don't know the exact root cause here, but I'm pretty sure it's not ARC's fault. As a matter of fact, it seems someone is able to reproduce the issue here with simple StatefulSets
.
Checks
Controller Version
0.22.0
Helm Chart Version
v0.7.2
CertManager Version
v1.6.1
Deployment Method
Helm
cert-manager installation
Checks
Resource Definitions
To Reproduce
Describe the bug
I configured volumeClaimTemplates for the RunnerSet and number of replicas is 5. The volumeClaimTemplates contains 2 persistent volume mappings - one for docker and another for gradle. The runners are configured as ephemeral: true.
At the start of the RunnerSet deployment 10 PVs (5*2 - one for docker and another for gradle) are created and bound to all the runner pods. When a newly assigned workflow is run and completed on a runner, the runner pod is deleted and a brand-new pod is created and listens for jobs. Unfortunately, the newly created pod does not attach the recently freed available PVs (from the deleted runner pod), but instead it creates new set of PVs and attaches them to it.
Over the period of time these redundant PVs accumulate and the system becomes out of disk space.
Describe the expected behavior
PVs created from the RunnerSet deployment should be used efficiently.
When a newly assigned workflow is run and completed on a runner, the newly created pod should attach the recently freed available PVs from the deleted runner pod.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
NA