argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.07k stars 3.2k forks source link

argo-server crashes with `Error: issuer empty` after upgrade from 3.4.7 to 3.4.8 #11204

Closed yonirab closed 1 year ago

yonirab commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

We have been running Argo Workflows on GKE for several years. On upgrade from 3.4.7 to 3.4.8 the argo-server is crashing with Error: issuer empty

We run our argo-server with the following args and env:

      - args:
        - server
        - --auth-mode
        - sso
        - --auth-mode
        - client
        - --secure=false
        env:
        - name: BASE_HREF
          value: /
        - name: POD_NAMES
          value: v1

Version

v3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

Here are logs from our crashing argo-server

ERROR 2023-06-11T07:57:28.228850757Z [resource.labels.containerName: argo-server] time="2023-06-11T07:57:28.228Z" level=info msg="not enabling pprof debug endpoints"
ERROR 2023-06-11T07:57:28.229677405Z [resource.labels.containerName: argo-server] time="2023-06-11T07:57:28.229Z" level=info authModes="[sso client]" baseHRef=/ managedNamespace= namespace=argo secure=false ssoNamespace=argo
ERROR 2023-06-11T07:57:28.229696887Z [resource.labels.containerName: argo-server] time="2023-06-11T07:57:28.229Z" level=warning msg="You are running in insecure mode. Learn how to enable transport layer security: https://argoproj.github.io/argo-workflows/tls/"
ERROR 2023-06-11T07:57:28.238877373Z [resource.labels.containerName: argo-server] Error: issuer empty
ERROR 2023-06-11T07:57:28.239439677Z [resource.labels.containerName: argo-server] Usage:
INFO 2023-06-11T07:57:28.239441793Z [resource.labels.containerName: argo-server] issuer empty
ERROR 2023-06-11T07:57:28.239458895Z [resource.labels.containerName: argo-server] argo server [flags]
ERROR 2023-06-11T07:57:28.239463608Z [resource.labels.containerName: argo-server] {}
ERROR 2023-06-11T07:57:28.239467127Z [resource.labels.containerName: argo-server] Examples:
ERROR 2023-06-11T07:57:28.239471931Z [resource.labels.containerName: argo-server] {}
ERROR 2023-06-11T07:57:28.239476081Z [resource.labels.containerName: argo-server] See https://argoproj.github.io/argo-workflows/argo-server/
ERROR 2023-06-11T07:57:28.239479428Z [resource.labels.containerName: argo-server] {}
ERROR 2023-06-11T07:57:28.239482656Z [resource.labels.containerName: argo-server] Flags:
ERROR 2023-06-11T07:57:28.239486665Z [resource.labels.containerName: argo-server] --access-control-allow-origin string Set Access-Control-Allow-Origin header in HTTP responses.
ERROR 2023-06-11T07:57:28.239516109Z [resource.labels.containerName: argo-server] --allowed-link-protocol stringArray Allowed link protocol in configMap. Used if the allowed configMap links protocol are different from http,https. Defaults to the environment variable ALLOWED_LINK_PROTOCOL (default [http,https])
ERROR 2023-06-11T07:57:28.239525077Z [resource.labels.containerName: argo-server] --api-rate-limit uint Set limit per IP for api ratelimiter (default 1000)
ERROR 2023-06-11T07:57:28.239531696Z [resource.labels.containerName: argo-server] --auth-mode stringArray API server authentication mode. Any 1 or more length permutation of: client,server,sso (default [client])
ERROR 2023-06-11T07:57:28.239537789Z [resource.labels.containerName: argo-server] --basehref string Value for base href in index.html. Used if the server is running behind reverse proxy under subpath different from /. Defaults to the environment variable BASE_HREF. (default "/")
ERROR 2023-06-11T07:57:28.239543830Z [resource.labels.containerName: argo-server] -b, --browser enable automatic launching of the browser [local mode]
ERROR 2023-06-11T07:57:28.239547536Z [resource.labels.containerName: argo-server] --configmap string Name of K8s configmap to retrieve workflow controller configuration (default "workflow-controller-configmap")
ERROR 2023-06-11T07:57:28.239551373Z [resource.labels.containerName: argo-server] --event-async-dispatch dispatch event async
ERROR 2023-06-11T07:57:28.239555098Z [resource.labels.containerName: argo-server] --event-operation-queue-size int how many events operations that can be queued at once (default 16)
ERROR 2023-06-11T07:57:28.239558526Z [resource.labels.containerName: argo-server] --event-worker-count int how many event workers to run (default 4)
ERROR 2023-06-11T07:57:28.239562192Z [resource.labels.containerName: argo-server] -h, --help help for server
ERROR 2023-06-11T07:57:28.239568595Z [resource.labels.containerName: argo-server] --hsts Whether or not we should add a HTTP Secure Transport Security header. This only has effect if secure is enabled. (default true)
ERROR 2023-06-11T07:57:28.239572125Z [resource.labels.containerName: argo-server] --kube-api-burst int Burst to use while talking with kube-apiserver. (default 30)
ERROR 2023-06-11T07:57:28.239575524Z [resource.labels.containerName: argo-server] --kube-api-qps float32 QPS to use while talking with kube-apiserver. (default 20)
ERROR 2023-06-11T07:57:28.239578740Z [resource.labels.containerName: argo-server] --log-format string The formatter to use for logs. One of: text|json (default "text")
ERROR 2023-06-11T07:57:28.239582429Z [resource.labels.containerName: argo-server] --managed-namespace string namespace that watches, default to the installation namespace
ERROR 2023-06-11T07:57:28.239586419Z [resource.labels.containerName: argo-server] --namespaced run as namespaced mode
ERROR 2023-06-11T07:57:28.239590201Z [resource.labels.containerName: argo-server] -p, --port int Port to listen on (default 2746)
ERROR 2023-06-11T07:57:28.239594437Z [resource.labels.containerName: argo-server] --tls-certificate-secret-name string The name of a Kubernetes secret that contains the server certificates
ERROR 2023-06-11T07:57:28.239598108Z [resource.labels.containerName: argo-server] --x-frame-options string Set X-Frame-Options header in HTTP responses. (default "DENY")
ERROR 2023-06-11T07:57:28.239618971Z [resource.labels.containerName: argo-server] {}
ERROR 2023-06-11T07:57:28.239622730Z [resource.labels.containerName: argo-server] Global Flags:
ERROR 2023-06-11T07:57:28.239626135Z [resource.labels.containerName: argo-server] --argo-base-href string An path to use with HTTP client (e.g. due to BASE_HREF). Defaults to the ARGO_BASE_HREF environment variable.
ERROR 2023-06-11T07:57:28.239629505Z [resource.labels.containerName: argo-server] --argo-http1 If true, use the HTTP client. Defaults to the ARGO_HTTP1 environment variable.
ERROR 2023-06-11T07:57:28.239633449Z [resource.labels.containerName: argo-server] -s, --argo-server host:port API server host:port. e.g. localhost:2746. Defaults to the ARGO_SERVER environment variable.
ERROR 2023-06-11T07:57:28.239636760Z [resource.labels.containerName: argo-server] --as string Username to impersonate for the operation
ERROR 2023-06-11T07:57:28.239640254Z [resource.labels.containerName: argo-server] --as-group stringArray Group to impersonate for the operation, this flag can be repeated to specify multiple groups.
ERROR 2023-06-11T07:57:28.239643741Z [resource.labels.containerName: argo-server] --as-uid string UID to impersonate for the operation
ERROR 2023-06-11T07:57:28.239647186Z [resource.labels.containerName: argo-server] --certificate-authority string Path to a cert file for the certificate authority
ERROR 2023-06-11T07:57:28.239650598Z [resource.labels.containerName: argo-server] --client-certificate string Path to a client certificate file for TLS
ERROR 2023-06-11T07:57:28.239653885Z [resource.labels.containerName: argo-server] --client-key string Path to a client key file for TLS
ERROR 2023-06-11T07:57:28.239657317Z [resource.labels.containerName: argo-server] --cluster string The name of the kubeconfig cluster to use
ERROR 2023-06-11T07:57:28.239660711Z [resource.labels.containerName: argo-server] --context string The name of the kubeconfig context to use
ERROR 2023-06-11T07:57:28.239663909Z [resource.labels.containerName: argo-server] --gloglevel int Set the glog logging level
ERROR 2023-06-11T07:57:28.239667887Z [resource.labels.containerName: argo-server] -H, --header strings Sets additional header to all requests made by Argo CLI. (Can be repeated multiple times to add multiple headers, also supports comma separated headers) Used only when either ARGO_HTTP1 or --argo-http1 is set to true.
ERROR 2023-06-11T07:57:28.239671292Z [resource.labels.containerName: argo-server] --insecure-skip-tls-verify If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure
ERROR 2023-06-11T07:57:28.239697735Z [resource.labels.containerName: argo-server] -k, --insecure-skip-verify If true, the Argo Server's certificate will not be checked for validity. This will make your HTTPS connections insecure. Defaults to the ARGO_INSECURE_SKIP_VERIFY environment variable.
ERROR 2023-06-11T07:57:28.239704217Z [resource.labels.containerName: argo-server] --instanceid string submit with a specific controller's instance id label. Default to the ARGO_INSTANCEID environment variable.
ERROR 2023-06-11T07:57:28.239709360Z [resource.labels.containerName: argo-server] --kubeconfig string Path to a kube config. Only required if out-of-cluster
ERROR 2023-06-11T07:57:28.239714593Z [resource.labels.containerName: argo-server] --loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
ERROR 2023-06-11T07:57:28.239720042Z [resource.labels.containerName: argo-server] -n, --namespace string If present, the namespace scope for this CLI request
ERROR 2023-06-11T07:57:28.239724893Z [resource.labels.containerName: argo-server] --password string Password for basic authentication to the API server
ERROR 2023-06-11T07:57:28.239798535Z [resource.labels.containerName: argo-server] --proxy-url string If provided, this URL will be used to connect via proxy
ERROR 2023-06-11T07:57:28.239808469Z [resource.labels.containerName: argo-server] --request-timeout string The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. (default "0")
ERROR 2023-06-11T07:57:28.239812779Z [resource.labels.containerName: argo-server] -e, --secure Whether or not the server is using TLS with the Argo Server. Defaults to the ARGO_SECURE environment variable. (default true)
ERROR 2023-06-11T07:57:28.239816174Z [resource.labels.containerName: argo-server] --server string The address and port of the Kubernetes API server
ERROR 2023-06-11T07:57:28.239830127Z [resource.labels.containerName: argo-server] --tls-server-name string If provided, this name will be used to validate server certificate. If this is not provided, hostname used to contact the server is used.
ERROR 2023-06-11T07:57:28.239833483Z [resource.labels.containerName: argo-server] --token string Bearer token for authentication to the API server
ERROR 2023-06-11T07:57:28.239837145Z [resource.labels.containerName: argo-server] --user string The name of the kubeconfig user to use
ERROR 2023-06-11T07:57:28.239840600Z [resource.labels.containerName: argo-server] --username string Username for basic authentication to the API server
ERROR 2023-06-11T07:57:28.239844113Z [resource.labels.containerName: argo-server] -v, --verbose Enabled verbose logging, i.e. --loglevel debug
ERROR 2023-06-11T07:57:28.239847408Z [resource.labels.containerName: argo-server] {}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
yonirab commented 1 year ago

It looks like the upgrade to v3.4.8 was somehow causing all the data in my workflow-controller-configmap to get wiped!!! And hence there was no issuer found for my SSO configuration.

I saw this happen on 2 attempts to upgrade from 3.4.7 to 3.4.8. Both times all the data in my workflow-controller-configmap got wiped!!!

After reconfiguring the workflow-controller-configmap, the 3.4.8 argo-server finally agreed to start up without errors.

Does this make any sense??? Why on earth would upgrade from 3.4.7 to 3.4.8 cause the data in my workflow-controller-configmap to get wiped? Is there something I need to do to prevent that?

tico24 commented 1 year ago

For context there's a slack discussion here: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1686479990177909

erkerb4 commented 1 year ago

I had a similar experience, but the culprit (at least for my case) was due to this setting: https://github.com/argoproj/argo-helm/blob/main/charts/argo-workflows/values.yaml#L628 . I misinterpreted this comment: "## SSO is activated by adding --auth-mode=sso to the server command line.".

For the longest time, I've had server.sso.enabled: false because I've had the extraArgs configured. Since the latest release, chart will not render the sso settings for the configmap if you do not have `server.sso.enabled: true." After updating this reference, everything worked OK.

yonirab commented 1 year ago

I tried upgrading a different cluster to 3.4.8 and this time it worked fine. So not sure what the problem was in the first cluster, but I guess it must be something cluster specific, so this issue can be closed as far as I'm concerned.