actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.4k stars 1.04k forks source link

Issue running 0.9.2. Listener crash restart loop when running inside ArgoCD #3533

Closed Scalahansolo closed 1 week ago

Scalahansolo commented 1 month ago

Checks

Controller Version

0.9.2

Deployment Method

ArgoCD

Checks

To Reproduce

Installed the arc controller and runner set via Helm onto ArgoCD. Once everything came up, the controller would start up a listener that would just get stuck in a crash loop attempting to call out to github and failing GETs.

Describe the bug

When attempting to use the newest 0.9.2 version of the arc controller / runner sets, the listener that was spinning up kept having issues during startup. The screenshot below shows the issue that I was encountering. When downgrading to 0.9.1, the issue went away.

Describe the expected behavior

image

Additional Context

N/A

Controller Logs

The controller logs had nothing of use in them. They would spin up the listener, see that it died, and tried to spin up a new one. Over and over, forever.

Runner Pod Logs

N/A
github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 month ago

Hey @Scalahansolo,

Seems like something is canceling the context. Is it possible that something is sending SIGINT to the listener? I tried reproducing the issue, but the listener is up and running. Can you please try again? Maybe the issue was temporary error on the service side :confused:

carlos-jt commented 1 month ago

Same issue was happening too for me yesterday, so i did a rollback to 0.9.0.

marekaf commented 1 month ago

I had the same issue. After rollback one of three listeners doesn't come back up anymore for some spec. merge patch issue.

Scalahansolo commented 1 month ago

@nikola-jokic Im not convinced there was anything wrong with me setup. I can try to bump the version again, but the only change I made to get the listener to stay up was reverting to 0.9.1

clementnero commented 1 month ago

I think the key element here is ArgoCD. AutoscalingListener and associated pod should not be managed by ArgoCD but by the controller directly. AutoscalingListener resources are never generated by Helm, so ArgoCD will detect these resources as OutOfSync and try to remove them, which explains the SIGINT above. The problem comes from https://github.com/actions/actions-runner-controller/pull/3157 in version 0.9.2, which arbitrarily propagates labels to all resources created ... including the tracking label used by ArgoCD, which therefore sees resources created by the controller as having been created by itself, which it tries to delete consecutively as there is no associated definition.

@nikola-jokic a quick solution would be to revert https://github.com/actions/actions-runner-controller/pull/3157 or add a new setting to disable it.

Scalahansolo commented 1 month ago

That would make sense.

Given that though... the one nice thing about 0.9.2 is that the listener is seen by Argo. For those of us running ArgoCD setups, having K8s assets being spun up that aren't tracked in Argo can be pretty confusing as that's where people go to see the state of the world.

clementnero commented 1 month ago

I agree, but ownerReferences should be added for this purpose. Propagating the ArgoCD tracking label is definitely a blocking issue here.

xaemiphor commented 1 month ago

I picked the wrong day to migrate from the summerwind operator to the gha version, been battling this since yesterday thinking "how did anyone else make this work?" and looking for a toggle in the helm charts.

I can second the issue, #3157 replicating all labels causes argocd to repeatedly try to reap the unexpected AutoscalingListener, Role and RoleBinding created by the operator. Temporary workaround would be disabling auto-prune on the argocd application, but that's not ideal long term.

manveer-ce commented 1 month ago

I also encountered this, on a cluster that only runs ARC. I had to rollback to 0.9.1 for it to work

nikola-jokic commented 1 month ago

Thank you all for doing the investigation! Although we don't currently officially support ArgoCD, we should definitely fix this for the next release.

int128 commented 1 month ago

I tried ignoreDifferences of the Argo CD configuration but it could not ignore the resources. I'm looking for the next release!

xaemiphor commented 1 month ago

@int128 Yea unfortunately it's the whole object, while ignoreDifferences is for ignoring fields of a blessed object.

@nikola-jokic I would argue this isn't an issue of unsupported use of ArgoCD, but that #3157 was a large functional change forced onto users that has strong potential to cause issues due to other operators or operators imposed by k8s vendors. For example I have some proprietary operators that grant certain internal permissions, auto-inject various side-cars, or even swap some images based on labels that should not be replicated. If ArgoCD hadn't fallen on it's face from this change, we would've likely breached some security enforcement unknowingly, and potentially not detected it for quite a while. (Yes... test and evaluate outside of prod, not the point of the example.)

Could the issue described in #3157 not also been mostly resolved by simply setting the desired labels on the listenerTemplate podSpec from the helm chart? It doesn't solve the Role and RoleBinding labels, but that feels like a safer feature to add. Like I'm all for adding another chart value to toggle that logic on/off, but it feels pretty sledge-hammery.

Edit: Realized I'm not confident about how my message above reads, other than losing a day and costing my employer a day salary, I mean no anger/disrespect in the message. I wanted to highlight the non-ArgoCD perspective that ArgoCD canary'd for, so I wouldn't really blame ArgoCD for that.

Scalahansolo commented 1 month ago

I agree with @xaemiphor here. This isn't about "supporting ArgoCD". The values files for both the controller and runner-sets should allow adding arbitrary labels to the resources that get spun up by them. Naively taking all the labels from one resource and applying them to child / spawned resources isn't the right answer here.

nikola-jokic commented 1 month ago

Hey @xaemiphor and @Scalahansolo,

I completely understand your point of view and it does make sense. I will try to explain my position, but this doesn't mean that the issue shouldn't be fixed. Labels are usually used so you can search by them. The same reason your proprietary operator grants certain permissions by targeting the label can be the reason someone needs the label to be present on all resources. It is just a different use-case. By allowing label propagation, your operators or internal tooling can target internal resources more precisely. So for example, if you have scale sets associated with multiple orgs in your enterprise, and you are registering your scale set per repository, then you may want to apply a unique label per scale set associated with a single org, and have multiple prometheus scrapers targeting them by matchLabels.

I'm not blaming ArgoCD at all, I personally think it is an amazing project, please don't get me wrong. I'm just saying that since we don't support ArgoCD, I haven't tested it on ArgoCD.

wojciechka commented 1 month ago

I've done some more investigation regarding ArgoCD and found that it only happens when ArgoCD is using labels for tracking resources. It's not an issue with ArgoCD configured to use annotations for tracking resources.

I think the operator should support ArgoCD with label-based tracking if possible, but sharing it as a viable alternative. Switching from labels to annotations is a pretty smooth experience and does not cause resources to be re-created or anything.

I've also seen crossplane running into some issues - crossplane/crossplane#2121 also mentions it and using annotations as a solution to the problem.

More details can be found here: https://argo-cd.readthedocs.io/en/stable/user-guide/resource_tracking/

ironashram commented 1 month ago

breaking usage with ArgoCD's default configuration seems quite bad, especially since switching to annotations as tracking method is not exactly solving the issue entirely as the controller remains constantly in out of sync state

Edit turns out that was unrelated to the tracking method, still it's another thing that started with 0.9.2

managd to solve it with following if anyone is interested

  ignoreDifferences:
  - group: "apps"
    kind: "Deployment"
    jsonPointers:
      - /spec/template/spec/containers/0/resources
booleanbetrayal commented 1 month ago

Another ArgoCD user who had to revert with this upgrade. ArgoCD is pretty ubiquitous these days.

vyas-n commented 1 month ago

For what it's worth, I ran into the listener crash issue on ArgoCD using 0.9.1 version of these helm charts. I needed to delete my autoscaling runner set altogether and then deployed 0.9.0 (so far no issues with the older version).

I'm not sure what would be relevant, but if you need logs from a set of pods I can provide them

albertollamaso commented 1 month ago

Hi, just wanted to weigh in to this issue. We are using ArgoCD and had to roll back due same issue exposed on this thread.

thomassandslyst commented 1 month ago

This is affecting me too with ArgoCD (thought we don't use autosync so it's more just an annoyance).

As a proper solution for this could we just be able to set arbitrary labels and annotations for these created resources, then at least I can add an annotation to make ArgoCD ignore them.

greenkiwi commented 3 weeks ago

We are having the same issue with our ArgoCD instance and trying to setup GH ARC. Hopefully this can be sorted so we can move forward.

I agree, but ownerReferences should be added for this purpose. Propagating the ArgoCD tracking label is definitely a blocking issue here.

I agree with @clementnero

greenkiwi commented 3 weeks ago

It looks like someone has a PR up to at least remove the adding of the label.

https://github.com/actions/actions-runner-controller/pull/3575

agologan commented 2 weeks ago

While excluding certain labels would resolve this, we opted to switch to annotation tracking. It's what the ArgoCD maintainers also plan to do in the next major version as many operators use the same label propagation strategy. https://github.com/argoproj/argo-cd/issues/13981

@wojciechka thank you for pointing us in this direction.