argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.06k stars 5.52k forks source link

Dex error with google authentication, fixed by restart #9091

Open praveenperera opened 2 years ago

praveenperera commented 2 years ago

Checklist:

Describe the bug

Randomly the dex server will stop responding. When you go to login with google you get this error on web page:

Failed to query provider "[https://argocd.staging.mysite.com/api/dex":](https://argocd.stagingmysite.com/api/dex%22:) Get "[http://argocd-dex-server:5556/api/dex/.well-known/openid-configuration":](http://argocd-dex-server:5556/api/dex/.well-known/openid-configuration%22:) dial tcp 10.103.241.181:5556: connect: connection refused

In the pod

failed to initialize server: server: Failed to open connector google: failed to open connector: failed to create connector google: failed to get provider: Get "https://accounts.google.com/.well-known/openid-configuration": dial tcp: i/o timeout

To Reproduce

It happens randomly. But setup argo cd with google auth and wait.

Version

argocd: v2.3.3+07ac038.dirty
  BuildDate: 2022-03-30T05:14:36Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: dirty
  GoVersion: go1.18
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.3.1+b65c169
  BuildDate: 2022-03-10T22:51:09Z
  GitCommit: b65c1699fa2a2daa031483a3890e6911eac69068
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

v2.3.1+b65c169, but I've also seen it in older versions.

Screen Shot 2022-04-13 at 10 30 35 AM

Restart argo-cd dex workload fixes it, but then it will appear again after sometime. My current fix has been to setup a liveness probe. I can open a PR.

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/dex/.well-known/openid-configuration
            port: 5556
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
sarasensible commented 2 years ago

I also ran into this, the liveness probe didn't seem to help.

Failed to query provider "https://argocd.mysite.com/api/dex": Get "http://argo-cd-argocd-dex-server:5556/api/dex/.well-known/openid-configuration": dial tcp 172.20.233.60:5556: connect: connection refused
praveenperera commented 2 years ago

@sarasensible it's surprising that the liveness probe didn't work. Double check the configuration?

sarasensible commented 2 years ago

Yeah for some reason my Argo CD chart isn't picking up any changes I make to the dex configuration, so it's possible this could work and it's just not deploying. Very confusing.

Update: confirmed this was an issue with how I was deploying - dumb mistake recorded for all time in https://github.com/helm/helm/issues/10880#issuecomment-1106722961

eformat commented 2 years ago

i got this as well on OpenShift - specifically when cluster has been stopped, then restarted

time="2022-07-22T09:08:02Z" level=info msg="config refresh tokens rotation enabled: true"
failed to initialize server: server: Failed to open connector openshift: failed to open connector: failed to create connector openshift: failed to query OpenShift endpoint Get "https://kubernetes.default.svc/.well-known/oauth-authorization-server": dial tcp: i/o timeout

Restarting dex pod fixes it .. ah yes .. not liveness check on the Deployment .. that will help!

alexmt commented 2 years ago

Also reproduced it :(

sarasensible commented 1 year ago

This happened again for me on upgrade to chart version 5.20.4 which is 2.6.1 app version because somewhere along the line extraVolumes and extraVolumeMounts were renamed to volumes and volumeMounts in the helm chart. Fixed by renaming the fields in my helm config.

lukma99 commented 1 year ago

Same problem, happened twice in the last month, running on the latest argo cd version (currently 2.6.2) with the latest helm chart. Fixed by restarting Dex. Besides these two times the google login works perfectly.

Last log message from dex:

failed to initialize server: server: Failed to open connector google: failed to open connector: failed to create connector google: failed to get provider: Get "https://accounts.google.com/.well-known/openid-configuration": dial tcp: lookup accounts.google.com on IP_ADDRESS: read udp IP_ADDRESS:PORT->IP_ADDRESS:PORT: read: connection refused
GottZ commented 1 year ago

same issue here in 2023 but with GitHub auth. still relevant.

HackerM0nk commented 1 year ago

By providing dex config in following manner, then updating the argocd-cm configmap & restarting the argocd-dex-server-xyz123 pod worked for me like a charm.

Doc link: https://argo-cd.readthedocs.io/en/stable/operator-manual/user-management/google/#openid-connect-plus-google-groups-using-dex

    dex.config: |
      connectors:
      - config:
          issuer: https://accounts.google.com
          redirectURI: https://argocd.example.com/api/dex/callback
          clientID: abc-xys.apps.googleusercontent.com
          clientSecret: abc-XYZ_123
          serviceAccountFilePath: /tmp/oidc/googleAuth.json
          adminEmail: name@example.com
        type: oidc
        id: google
        name: Google
de-mkolovic commented 1 year ago

Observed this behavior after upgrading from K8S 1.26 to 1.27. All pods were restarted by the upgrade process but after the upgrade the dex deployment had to be rolled again for Google SSO to work.

saikatharryc commented 1 year ago

I've been using this for a while and I have noticed, this is reproducible every time a spot node comes up and this pod gets scheduled there. every time I get

Failed to query provider "https://argocd.<masked>.europe-west3-gcloud.internal.<masked>.io/api/dex": Get "https://argocd-dex-server:5556/api/dex/.well-known/openid-configuration": dial TCP <PRIVATE_IP>:5556: connect: connection refused

until everytime i had to restart this service and everything works as charm , like when i was using on-demand nodes instead of spot.

mou commented 9 months ago

I can confirm this issue is reproduced on spot nodes in GCP with google SSO

spoletum commented 8 months ago

Same here using the standard Helm chart. Killing the dex-server pod fixes the problem.

shameemshah commented 8 months ago

Same here using the standard Helm chart. Killing the dex-server pod fixes the problem.

Chickenmarkus commented 7 months ago

The issue is still valid with helm chart version 6.7.2 which includes v6.3.0 which should include "the fix".

armantur commented 6 months ago

We have the same issue in GCP preemptible (spot) instances with helm chart version 6.10.2.

zelig81 commented 1 month ago

The same issue, my solution was the same as proposed + startupProbe to /healthz. Here is the kustomize patch (if you do not use helm):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-dex-server
spec:
  template:
    spec:
      containers:
        - name: dex
          startupProbe:
            failureThreshold: 3
            httpGet:
              path: /api/dex/healthz
              port: 5556
              scheme: HTTPS
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            timeoutSeconds: 5
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /api/dex/.well-known/openid-configuration
              port: 5556
              scheme: HTTPS
            initialDelaySeconds: 30
            periodSeconds: 30
            successThreshold: 1
            timeoutSeconds: 5
marin-h commented 1 month ago

Same issue here with argocd 2.10.9 in EKS

qugu commented 4 weeks ago

Looks like the same issue with chart 7.6.12.

aqeelat commented 1 week ago

We had the same issue with 7.6.12 but the fix was to comment out the dex version in our values.yaml override which made it upgrade dex to a newer version