argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.05k stars 5.51k forks source link

ApplicationSet w/ SCM Provider generator + GitHub Enterprise maintenance = generated Applications deleted #11464

Open frantjc opened 2 years ago

frantjc commented 2 years ago

Checklist:

Describe the bug

When you create an ApplicationSet that uses the SCM Provider generator pointed at an Organization in GitHub Enterprise and it generates an Application, when that GitHub Enterprise instance goes into maintenance mode, then the ApplicationSet deletes all of the Applications it had previously generated.

Presumably this functionality extends beyond the scope of GitHub Enterprise--any unexpected error between the SCM Provider and its configured backend could result in all of its generated Applications being deleted.

To Reproduce

Expected behavior

When GitHub Enterprise is put into maintenance mode (or any other kind of unexpected state e.g. network issues), I'd expect an ApplicationSet using the SCM Provider generator that is pointed to said GitHub Enterprise instance to notice that something out of the ordinary is going on, perhaps mark the ApplicationSet as unhealthy and, most importantly, not delete all of the Applications that said ApplicationSet had generated.

Version

I do not have access to the argocd binary in question, but the version from the UI is v2.3.3

crenshaw-dev commented 2 years ago

@frantjc the ApplicationSet controller uses the GHE API to populate the output of the SCM Provider generator. Do you have samples of what the GHE API returns for those API calls when GHE is in maintenance mode?

My hope is that it would return a non-200 response code, and the ApplicationSet controller would refuse to proceed with reconciliation. But it sounds like either a 200 is returned, or the ApplicationSet controller doesn't check the response code.

frantjc commented 2 years ago

Hi @crenshaw-dev! GitHub Enterprise appears to be properly reporting its "error" state when in Maintenance mode. Posting responses from the list repositories for organization endpoint (as GitHub's official npm module @octokit/rest refers to it) as I believe that is the important one in this case. I've modified them slightly to omit info about company, auth, etc.

Normal:

{
    "data": ["I removed the repository objects here for brevity but there are up to 100 repositories here depending on the number of repositories in the organization"],
    "status": 200,
    "url": "https://github.mycorp.com/api/v3/orgs/myorg/repos",
    "headers": {
        "access-control-allow-origin": "*",
        "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
        "cache-control": "private, max-age=60, s-maxage=60",
        "content-encoding": "gzip",
        "content-security-policy": "default-src 'none'",
        "content-type": "application/json; charset=utf-8",
        "date": "Tue, 29 Nov 2022 17:01:02 GMT",
        "etag": "W/\"2b6804769b3d2d550ccfb9664bb8f2a50049932853940a40c4aa18ca097d12c3\"",
        "link": "<https://github.mycorp.com/api/v3/organizations/39/repos?page=2>; rel=\"next\", <https://github.mycorp.com/api/v3/organizations/39/repos?page=29>; rel=\"last\"",
        "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
        "server": "GitHub.com",
        "strict-transport-security": "max-age=31536000; includeSubdomains",
        "transfer-encoding": "chunked",
        "vary": "Accept, Authorization, Cookie, X-GitHub-OTP",
        "x-accepted-oauth-scopes": "",
        "x-content-type-options": "nosniff",
        "x-frame-options": "deny",
        "x-github-enterprise-version": "3.7.0",
        "x-github-media-type": "github.v3; format=json",
        "x-github-request-id": "a1c1ce0c-6e87-4aa3-8e00-36c077555ef1",
        "x-runtime-rack": "0.477439",
        "x-xss-protection": "0"
    }
}

Maintenance:

{
    "data": "I removed HTML from here, can paste screenshot of it rendered if necessary but GitHub currently isn't letting me upload it",
    "url": "https://github.mycorp.com/api/v3/orgs/myorg/repos",
    "status": 503,
    "headers": {
        "content-length": "702301",
        "content-type": "text/html",
        "date": "Tue, 29 Nov 2022 17:17:07 GMT",
        "etag": "6372c72f-ab75d",
        "server": "GitHub.com"
    }
}
crenshaw-dev commented 2 years ago

Thanks! I bet this needs to check the response status. I'm guessing the GitHub client doesn't return an error for a non-200 response code. https://github.com/argoproj/argo-cd/blob/362abff610d81a4878e53cecb78dcb2902776f5b/applicationset/services/scm_provider/github.go#L48

frantjc commented 2 years ago

I suspected the same, though I was looking here: https://github.com/argoproj/argo-cd/blob/362abff610d81a4878e53cecb78dcb2902776f5b/applicationset/services/scm_provider/github.go#L73

I see that this particular function call appears to potentially return information about the request outside of just the parsed body (the resp variable)--perhaps that contains the HTTP status code that could be checked?

ciiay commented 1 year ago

Hi @crenshaw-dev Any updates for this issue? We have a customer also run into same issue where applicationSet with SCM Provider is not working while other type applicationSet is working.