OpenUnison / openunison-k8s

Access portal for Kubernetes
Apache License 2.0
93 stars 5 forks source link

Orchestra Pods are utilising memory keep restarting when hitting the limits #66

Closed vikaspasunuri9 closed 1 year ago

vikaspasunuri9 commented 1 year ago

Please find the utilisation trend for orchestra pods and the status of the orchestra pods on the command line

image (16) image (17)
shnigam2 commented 1 year ago

@mlbiam I request you to please check this issue.

When we are decribing orchestra pods it is showing below errors:

  Normal   Pulling    12m   kubelet            Pulling image "mckinsey-cngccp-docker-k8s.jfrog.io/openunison-k8s-login-oidc:6e2748ab663d4dd1a2f0039278e05decf8adea5135be16cd0dabedd1946076e4"
  Normal   Pulled     11m   kubelet            Successfully pulled image "mckinsey-cngccp-docker-k8s.jfrog.io/openunison-k8s-login-oidc:6e2748ab663d4dd1a2f0039278e05decf8adea5135be16cd0dabedd1946076e4" in 20.304641249s
  Normal   Created    11m   kubelet            Created container openunison-orchestra
  Normal   Started    11m   kubelet            Started container openunison-orchestra
  Warning  Unhealthy  11m   kubelet            Readiness probe failed: Traceback (most recent call last):
  File "/usr/local/openunison/bin/check_alive.py", line 18, in <module>
    res = urllib2.urlopen(url_to_test,context=ctx).read()
  File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1248, in https_open
    context=self._context)
  File "/usr/lib/python2.7/urllib2.py", line 1205, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

Please let us know where we can check memory leak or how to debug..

Regards Shobhit

mlbiam commented 1 year ago
  1. Is the problem continuing to happen? How often?
  2. openunison-k8s-login-oidc has been out of support since 12/31/2022. I'll do my best to help, but if there's a bug that needs to be fixed, it will not go into this container (We supported openunison-k8s-login-oidc for over a year after we moved to openunison-k8s)
  3. How long has openunison been running on this cluster prior to the issues?
  4. What changed prior to the spike in resource utilization?
  5. Are you storing logs? Are there any errors or Exceptions in the logs for orchestra?
  6. Are you using Prometheus (or something compatible) that can hit OpenUnison's metrics endpoint? That will give us much better stats as to concurrent users, stack utilization, etc - https://openunison.github.io/knowledgebase/prometheus/
  7. Please provide your values.yaml
shnigam2 commented 1 year ago

Hi @mlbiam ,

Its keep on restarting in few interval of time with reason as error and we observed memory spikes at the same time when it hits its memory limits. We tried to increase limits as well but it is keep on consuming all it limits. Below is the error which we observed in the logs:-

[2023-02-16 10:58:45,991][main] WARN  UrlHolder - Could not process url : ''
java.net.MalformedURLException: no protocol: 
    at java.net.URL.<init>(URL.java:645) ~[?:?]
    at java.net.URL.<init>(URL.java:541) ~[?:?]
    at java.net.URL.<init>(URL.java:488) ~[?:?]
    at com.tremolosecurity.config.util.UrlHolder.<init>(UrlHolder.java:125) [unison-sdk-1.0.24.jar:?]
    at com.tremolosecurity.config.util.UnisonConfigManagerImpl.addAppInternal(UnisonConfigManagerImpl.java:733) [unison-server-core-1.0.24.jar:?]
    at com.tremolosecurity.config.util.UnisonConfigManagerImpl.loadApplicationObjects(UnisonConfigManagerImpl.java:634) [unison-server-core-1.0.24.jar:?]
    at com.tremolosecurity.config.util.UnisonConfigManagerImpl.initialize(UnisonConfigManagerImpl.java:427) [unison-server-core-1.0.24.jar:?]
    at com.tremolosecurity.filter.UnisonServletFilter.init(UnisonServletFilter.java:360) [unison-server-core-1.0.24.jar:?]
    at com.tremolosecurity.openunison.OpenUnisonServletFilter.init(OpenUnisonServletFilter.java:118) [open-unison-classes-1.0.24.jar:?]
    at io.undertow.servlet.core.LifecyleInterceptorInvocation.proceed(LifecyleInterceptorInvocation.java:111) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.ManagedFilter.createFilter(ManagedFilter.java:80) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.DeploymentManagerImpl$2.call(DeploymentManagerImpl.java:594) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.DeploymentManagerImpl$2.call(DeploymentManagerImpl.java:559) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:42) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at io.undertow.servlet.core.DeploymentManagerImpl.start(DeploymentManagerImpl.java:601) [undertow-servlet-2.2.12.Final.jar:2.2.12.Final]
    at com.tremolosecurity.openunison.undertow.OpenUnisonOnUndertow.main(OpenUnisonOnUndertow.java:353) [openunison-on-undertow-1.0.24.jar:?]
[2023-02-16 10:58:46,013][main] INFO  BrokerHolder - Starting KahaDB with path /tmp/amq/unison-mq-local
[2023-02-16 10:58:46,199][main] INFO  BrokerService - Loaded the Bouncy Castle security provider at position: -1
[2023-02-16 10:58:46,202][main] INFO  BrokerHolder - Waiting for broker to start...
[2023-02-16 10:58:46,487][Thread-0] INFO  BrokerService - Using Persistence Adapter: KahaDBPersistenceAdapter[/tmp/amq/unison-mq-local]
[2023-02-16 10:58:47,302][Thread-0] INFO  PListStoreImpl - PListStore:[/tmp/unison-tmp-mq-local] started

We are checking on prometheus possibilities.

Regards Shobhit

mlbiam commented 1 year ago

That's a startup warning and can be ignored. What is the interval? Hourly? Daily?

Also, please provide logs from when the spike happens.

Finally, how long has this cluster been running openunison? Since the version you are running is pretty old, if there is a bug it's likely being triggered by something specific to your environment.

ashish-dua commented 1 year ago

values.yaml for affected one:

Deployment_data
   liveness_probe_command:
    - /usr/local/openunison/bin/check_alive.py
    node_selectors: []
    pull_secret: jfrog-auth
    readiness_probe_command:
    - /usr/local/openunison/bin/check_alive.py
    - https://127.0.0.1:8443/auth/idp/k8sIdp/.well-known/openid-configuration
    - issuer
    resources:
      limits:
        cpu: 500m
        memory: 2.5Gi
      requests:
        cpu: 200m
        memory: 1024Mi
    tokenrequest_api:
      audience: api
      enabled: false
      expirationSeconds: 14400
  dest_secret: orchestra
  enable_activemq: false
  hosts:
  - annotations:
    - name: certmanager.k8s.io/cluster-issuer
      value: letsencrypt
    - name: kubernetes.io/ingress.class
      value: openunison
    ingress_name: openunison
    ingress_type: nginx
    names:
    - env_var: OU_HOST
      name: ***
    - env_var: K8S_DASHBOARD_HOST
      name: ***
    - env_var: K8S_API_HOST
      name: ***
      service_name: kube-oidc-proxy-orchestra
    secret_name: ou-tls-certificate
  image: ***
  key_store:
    key_pairs:
      create_keypair_template:
      - name: ou
        value: ***
      - name: o
        value: Stg
      - name: l
        value: ***
      - name: st
        value: North Virginia
      - name: c
        value: US
      keys:
      - create_data:
          ca_cert: true
          key_size: 2048
          server_name: openunison-orchestra.openunison.svc
          sign_by_k8s_ca: false
          subject_alternative_names:
          - ***
        import_into_ks: keypair
        name: unison-tls
      - create_data:
          ca_cert: true
          delete_pods_labels:
          - k8s-app=kubernetes-dashboard
          key_size: 2048
          secret_info:
            cert_name: dashboard.crt
            key_name: dashboard.key
            type_of_secret: Opaque
          server_name: kubernetes-dashboard.kubernetes-dashboard.svc
          sign_by_k8s_ca: false
          subject_alternative_names: []
          target_namespace: kubernetes-dashboard
        import_into_ks: certificate
        name: kubernetes-dashboard
        replace_if_exists: true
        tls_secret_name: kubernetes-dashboard-certs
      - create_data:
          ca_cert: true
          key_size: 2048
          server_name: unison-saml2-rp-sig
          sign_by_k8s_ca: false
          subject_alternative_names: []
        import_into_ks: keypair
        name: unison-saml2-rp-sig
    static_keys:
    - name: session-unison
      version: 1
    - name: lastmile-oidc
      version: 1
    trusted_certificates: []
    update_controller:
      days_to_expire: 10
      image: docker.io/tremolosecurity/kubernetes-artifact-deployment:1.1.0
      schedule: 0 2 * * *
  myvd_configmap: ""
  non_secret_data:
  - name: K8S_URL
    value: ***
  - name: SESSION_INACTIVITY_TIMEOUT_SECONDS
    value: "36000"
  - name: K8S_DASHBOARD_NAMESPACE
    value: kubernetes-dashboard
  - name: K8S_DASHBOARD_SERVICE
    value: kubernetes-dashboard
  - name: K8S_CLUSTER_NAME
    value: ***
  - name: K8S_IMPERSONATION
    value: "true"
  - name: PROMETHEUS_SERVICE_ACCOUNT
    value: system:serviceaccount:monitoring:prometheus-k8s
  - name: OIDC_CLIENT_ID
    value: 0oa7xnn4coSsw1dvD357
  - name: OIDC_IDP_AUTH_URL
    value: ***
  - name: OIDC_IDP_TOKEN_URL
    value: ***
  - name: OIDC_IDP_LIMIT_DOMAIN
    value: ""
  - name: SUB_CLAIM
    value: sub
  - name: EMAIL_CLAIM
    value: email
  - name: GIVEN_NAME_CLAIM
    value: given_name
  - name: FAMILY_NAME_CLAIM
    value: family_name
  - name: DISPLAY_NAME_CLAIM
    value: name
  - name: GROUPS_CLAIM
    value: groups
  - name: OIDC_USER_IN_IDTOKEN
    value: "false"
  - name: OIDC_IDP_USER_URL
    value: ***
  - name: OIDC_SCOPES
    value: openid email profile groups
  - name: OU_SVC_NAME
    value: openunison-orchestra.openunison.svc
  - name: K8S_TOKEN_TYPE
    value: legacy
  openunison_network_configuration:
    activemq_dir: /tmp/amq
    allowed_client_names: []
    ciphers:
    - *
    client_auth: none
    force_to_secure: true
    open_external_port: 80
    open_port: 8080
    path_to_deployment: /usr/local/openunison/work
    path_to_env_file: /etc/openunison/ou.env
    quartz_dir: /tmp/quartz
    secure_external_port: 443
    secure_key_alias: unison-tls
    secure_port: 8443
  replicas: 1
  secret_data:
  - K8S_DB_SECRET
  - unisonKeystorePassword
  - OIDC_CLIENT_SECRET
  source_secret: orchestra-secrets-source
mlbiam commented 1 year ago

thanks for providing the config. nothing stands out. Still need to understand:

  1. How long has this cluster been running with openunison?
  2. What changed?
  3. Logs from when the spike occurred

Thanks

mlbiam commented 1 year ago

no new updates, closing