actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

"Reconciler error" when upgrading 0.9.0->0.9.1 #3562

Closed alexgaganashvili closed 5 months ago

alexgaganashvili commented 5 months ago

Checks

Controller Version

0.9.2

Deployment Method

Helm

Checks

To Reproduce

2024-05-31T22:07:05Z ERROR AutoscalingRunnerSet Failed to update autoscaling runner set with finalizer added {"version": "0.9.2", "autoscalingrunnerset": {"name":"my-runner-scaleset","namespace":"my-namespace"}, "error": "autoscalingrunnersets.actions.github.com \"my-runner-scaleset\" not found"} github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(AutoscalingRunnerSetReconciler).Reconcile github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:182 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2 sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 2024-05-31T22:07:05Z ERROR Reconciler error {"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"my-runner-scaleset","namespace":"my-namespace"}, "namespace": "my-namespace", "name": "my-runner-scaleset", "reconcileID": "ac974287-9a8c-4477-bf87-78713c03104d", "error": "autoscalingrunnersets.actions.github.com \"my-runner-scaleset\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2 sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227

Describe the bug

See the "To Reproduce" field + no listener or runner pods are created (I have minRunners # set to 1).

Describe the expected behavior

I should be able to transparently upgrade the controller and the scaleset.

Additional Context

Runner image: ghcr.io/actions/actions-runner:2.316.1
Controller image: ghcr.io/actions/gha-runner-scale-set-controller:0.9.2

Controller Logs

2024-05-31T22:07:05Z ERROR AutoscalingRunnerSet Failed to update autoscaling runner set with finalizer added {"version": "0.9.2", "autoscalingrunnerset": {"name":"my-runner-scaleset","namespace":"my-namespace"}, "error": "autoscalingrunnersets.actions.github.com \"my-runner-scaleset\" not found"} github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(AutoscalingRunnerSetReconciler).Reconcile github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:182 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2 sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 2024-05-31T22:07:05Z ERROR Reconciler error {"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"my-runner-scaleset","namespace":"my-namespace"}, "namespace": "my-namespace", "name": "my-runner-scaleset", "reconcileID": "ac974287-9a8c-4477-bf87-78713c03104d", "error": "autoscalingrunnersets.actions.github.com \"my-runner-scaleset\" not found"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2 sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227

Runner Pod Logs

No runner pod gets scheduled; so no logs.
github-actions[bot] commented 5 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

alexgaganashvili commented 5 months ago

cc: @nikola-jokic

nikola-jokic commented 5 months ago

Hey @alexgaganashvili,

The steps you took to upgrade ARC were wrong. Please follow the guide we documented here.

Basically, you need to uninstall every scale set, wait for resources to be cleaned up, uninstall the controller and begin the installation at the target version. I don't understand the way you deployed your application on argocd, but it would be worth mentioning that the controller and the scale set should be managed separately.

Closing this issue now, but feel free to comment on it if you need more information :relaxed:

alexgaganashvili commented 5 months ago

Thanks, @nikola-jokic . If I deploy ARC in HA mode in two different clusters and later upgrade it in one of the K8s clusters, will the jobs executing on runners of the scaleset being uninstalled be simply aborted?

nikola-jokic commented 5 months ago

That is a good question. When you uninstall the scale set, we will start by removing the listener. The second cluster keeps acquiring jobs and will continue to work normally. The cluster where you are removing the scale set from will keep the running ephemeral runners up until they are finished, and it will kill the runners that are not busy. The controller should never abort runners if they are busy.

alexgaganashvili commented 5 months ago

Thanks for clarifying that, @nikola-jokic. Btw, when I was upgrading ARC and hit that issue the first time, I did uninstall both the runner controller and the scaleset. However, when I installed the new version, I could still observe the same error (in light of this, step 2 Wait for resources cleanup in the upgrade instructions makes me wonder whether I should wait longer for all relevant resources to be cleaned up; one such custom resource would be autoscalingrunnerset?; which seems to sit there for a while until I end up setting its finalizer to an empty array). I then decided to uninstall the CRDs and install them again, even though I diffed the CRDs from the two versions and did not see any changes. But that allowed me to upgrade.

nikola-jokic commented 5 months ago

Hey @alexgaganashvili,

Yes, that is a known issue that has been fixed with this PR and will be part of the next release. The controller was slow to react on resource deletion event. This can cause some confusing behavior especially if running in containerMode=kubernetes. Hopefully, after the next release, this issue will be resolved completely :relaxed:

alexgaganashvili commented 5 months ago

@nikola-jokic , I have reinstalled CRDs and tried installing version 0.9.3 of ARC. After I installed a scaleset, the runner reported the same error. I even previously deleted the namespaces where I deploy the controller and scaleset.

alexgaganashvili commented 3 months ago

@nikola-jokic , could this issue be reopned and addressed? I'm running into the same problem with version 0.9.3. Thx.