AzBuilder / terrakube-helm-chart

Helm chart to install Terrakube in any Kubernetes cluster
Apache License 2.0
33 stars 25 forks source link

Azure AKS Helm Deployment Error #100

Closed ArkShocer closed 8 months ago

ArkShocer commented 8 months ago

Hi, first of all thank you to everyone for the amazing tool! I sadly haven't gotten around to make it work but it looks really promising to me! Maybe someone can direct me on finding the error on my deployment. When deploying Terrakube with the chart version 3.14.2 on my 1.28.3 AKS via Helm (using argocd but should be the same) I get errors on the api and executor pod.

Error on api pod:

Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
2024-03-01 14:38:29.962 ERROR 1 --- [_MisfireHandler] o.s.s.quartz.LocalDataSourceJobStore     : MisfireHandler: Error handling misfires: Couldn't retrieve trigger: No record found for selection of Trigger with key: 'DEFAULT.TerrakubeV2_ModuleRefresh' and statement: SELECT * FROM QRTZ_CRON_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND TRIGGER_NAME = ? AND TRIGGER_GROUP = ?
org.quartz.JobPersistenceException: Couldn't retrieve trigger: No record found for selection of Trigger with key: 'DEFAULT.TerrakubeV2_ModuleRefresh' and statement: SELECT * FROM QRTZ_CRON_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND TRIGGER_NAME = ? AND TRIGGER_GROUP = ?
Caused by: java.lang.IllegalStateException: No record found for selection of Trigger with key: 'DEFAULT.TerrakubeV2_ModuleRefresh' and statement: SELECT * FROM QRTZ_CRON_TRIGGERS WHERE SCHED_NAME = 'schedulerFactoryBean' AND TRIGGER_NAME = ? AND TRIGGER_GROUP = ?
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
Caused by: java.net.UnknownHostException: terrakube-redis-master
2024-03-01 14:39:17.605  INFO 1 --- [ionShutdownHook] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_terrakube-api-59fb9845c8-nh7fm1709303669667 paused.
2024-03-01 14:39:17.720  INFO 1 --- [ionShutdownHook] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_terrakube-api-59fb9845c8-nh7fm1709303669667 shutting down.
2024-03-01 14:39:17.720  INFO 1 --- [ionShutdownHook] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_terrakube-api-59fb9845c8-nh7fm1709303669667 paused.
2024-03-01 14:39:17.721  INFO 1 --- [ionShutdownHook] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_terrakube-api-59fb9845c8-nh7fm1709303669667 shutdown complete.

My values.yaml looks like the following:

## Azure Active Directory Security
security:
  useOpenLDAP: false
  adminGroup: "aad_example"
  patSecret: "example"
  internalSecret: "example"
  dexClientId: "microsoft"
  dexClientScope: "email openid profile offline_access groups"
  dexIssuerUri: "http://terrakube-api.example.dev/dex"

## Dex
dex:
  enabled: true
  existingSecret: false
  config:
    issuer: http://terrakube-api.aks.example.dev/dex
    storage:
      type: memory
    web:
      http: 0.0.0.0:5556
      allowedOrigins: ["*"]
      skipApprovalScreen: true
    oauth2:
      responseTypes: ["code", "token", "id_token"]
    connectors:
      - type: microsoft
        id: microsoft
        name: microsoft
        config:
          clientID: "example"
          clientSecret: "example"
          redirectURI: "http://terrakube-api.aks.example.dev/dex/callback"
          tenant: "organizations"
    staticClients:
      - id: microsoft
        redirectURIs:
          - "http://terrakube-ui.aks.example.dev"
          - "/device/callback"
          - "http://localhost:10000/login"
          - "http://localhost:10001/login"
        name: "microsoft"
        public: true

## Terraform Storage
storage:
  defaultStorage: false
  azure:
    storageAccountName: "XXXXX"
    storageAccountResourceGroup: "XXXXX"
    storageAccountAccessKey: "XXXXXXXX"

## API properties
api:
  enabled: true
  replicaCount: "1"
  serviceType: "ClusterIP"
  defaultDatabase: false
  loadSampleData: false
  properties:
    databaseType: "SQL_AZURE"
    databaseHostname: "example.database.windows.net"
    databaseName: "example"
    databaseUser: "example"
    databasePassword: "Easy12365"

## The database port is only used for mysql databases

## SslMode values are disable, allow, prefer, require, verify-ca, verify-full. Default mode is "disable".
## Reference: https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGProperty.html#SSL_MODE

## Executor properties
executor:
  enabled: true
  replicaCount: "1"
  serviceType: "ClusterIP"
  properties:
    toolsRepository: "https://github.com/AzBuilder/terrakube-extensions"
    toolsBranch: "main"
## Registry properties
registry:
  enabled: true
  replicaCount: "1"
  serviceType: "ClusterIP"

## UI Properties
ui:
  enabled: true
  replicaCount: "1"
  serviceType: "ClusterIP"

## Ingress properties
ingress:
  useTls: true
  ui:
    enabled: true
    domain: "terrakube-ui.aks.example.dev"
    path: "/"
    pathType: "Prefix"
    annotations:
      nginx.ingress.kubernetes.io/use-regex: "true"
      cert-manager.io/cluster-issuer: letsencrypt-prod
  api:
    enabled: true
    domain: "terrakube-api.aks.example.dev"
    path: "/"
    pathType: "Prefix"
    annotations:
      nginx.ingress.kubernetes.io/use-regex: "true"
      nginx.ingress.kubernetes.io/configuration-snippet: "proxy_set_header Authorization $http_authorization;"
      cert-manager.io/cluster-issuer: letsencrypt-prod
  registry:
    enabled: true
    domain: "terrakube-reg.aks.example.dev"
    path: "/"
    pathType: "Prefix"
    annotations:
      nginx.ingress.kubernetes.io/use-regex: "true"
      nginx.ingress.kubernetes.io/configuration-snippet: "proxy_set_header Authorization $http_authorization;"
      cert-manager.io/cluster-issuer: letsencrypt-prod
  dex:
    enabled: true
    path: "/dex/"
    pathType: "Prefix"
    annotations:
      nginx.ingress.kubernetes.io/use-regex: "true"
      nginx.ingress.kubernetes.io/configuration-snippet: "proxy_set_header Authorization $http_authorization;"
      cert-manager.io/cluster-issuer: letsencrypt-prod

If it helps I can also post the full logs of the api or executor pod.

alfespa17 commented 8 months ago

By default the helm chart will deploy a standalone redis instance called "terrakube-redis-master" in your namespace, could you validate if redis was deployed?

https://github.com/AzBuilder/terrakube-helm-chart/blob/55815e6f882f66d139c164f10451bc7a572f39df/charts/terrakube/values.yaml#L156

ArkShocer commented 8 months ago

Yes, I can confirm redis was deployed as expected image Logs of redis:

1:C 01 Mar 2024 14:46:33.396 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 01 Mar 2024 14:46:33.396 # Redis version=7.0.11, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 01 Mar 2024 14:46:33.396 # Configuration loaded
1:M 01 Mar 2024 14:46:33.396 * monotonic clock: POSIX clock_gettime
1:M 01 Mar 2024 14:46:33.397 * Running mode=standalone, port=6379.
1:M 01 Mar 2024 14:46:33.401 # Server initialized
1:M 01 Mar 2024 14:46:33.417 * Reading RDB base file on AOF loading...
1:M 01 Mar 2024 14:46:33.417 * Loading RDB produced by version 7.0.11
1:M 01 Mar 2024 14:46:33.417 * RDB age 91225 seconds
1:M 01 Mar 2024 14:46:33.417 * RDB memory usage when created 0.82 Mb
1:M 01 Mar 2024 14:46:33.417 * RDB is base AOF
1:M 01 Mar 2024 14:46:33.417 * Done loading RDB, keys loaded: 0, keys expired: 0.
1:M 01 Mar 2024 14:46:33.417 * DB loaded from base file appendonly.aof.1.base.rdb: 0.009 seconds
1:M 01 Mar 2024 14:46:33.417 * DB loaded from append only file: 0.009 seconds
1:M 01 Mar 2024 14:46:33.417 * Opening AOF incr file appendonly.aof.1.incr.aof on server start
1:M 01 Mar 2024 14:46:33.417 * Ready to accept connections
alfespa17 commented 8 months ago

When the API is deployed it creates one secret (terrakube-api-secrets) where it handle the Redis connection.

https://github.com/AzBuilder/terrakube-helm-chart/blob/55815e6f882f66d139c164f10451bc7a572f39df/charts/terrakube/templates/secrets-api.yaml#L22

It looks like the API cannot resolve the redis service, not sure if maybe is an issue about the connectivity between your pods.

And the same will happen with the executor componet.

https://github.com/AzBuilder/terrakube-helm-chart/blob/55815e6f882f66d139c164f10451bc7a572f39df/charts/terrakube/templates/secrets-executor.yaml#L22

ArkShocer commented 8 months ago

I checked the connectivity between pods from a different app but in the same argo project (means they have the same network policys) and I can't find any connectivity issue between the pods from that app. I also now deployed an azure redis cache to test with an external redis cache but both pods still degrade after some time even though I can see server & memory load on the external redis.

ArkShocer commented 8 months ago

After some more digging in the Charts I found that the redis config is set to unsecure port 6379 by default, which is not enabled in azure redis cache but rather 6380 (ssl). So I allowed unsecure connections and all errors are gone in seconds (yay). The only thing im still clueless about is why the internal cluster redis connections where failing and why terrakube can't create an ingress connection right now.

alfespa17 commented 8 months ago

I checked the connectivity between pods from a different app but in the same argo project (means they have the same network policys) and I can't find any connectivity issue between the pods from that app. I also now deployed an azure redis cache to test with an external redis cache but both pods still degrade after some time even though I can see server & memory load on the external redis.

Maybe you can check the redis service name, for example if you deploy the default redis it will create a redis service with hostname "terrakube-redis-master" (the one that is used by default),

But if you are using an external redis in other namespace it should be "yourredisservice.namespace" if I remembered correctly.

image

ArkShocer commented 8 months ago

With everything local deployed it looks like the picture below for me. Yesterday, before the merge it looked exactly the same but didn't work. I think probably the ingress config merge fixed the connection issue between the pods from terrakube but I will check back on this later this week and close the issue if its gone now. image

alfespa17 commented 8 months ago

With everything local deployed it looks like the picture below for me. Yesterday, before the merge it looked exactly the same but didn't work. I think probably the ingress config merge fixed the connection issue between the pods from terrakube but I will check back on this later this week and close the issue if its gone now. image

By the way in AKS I think you need to use NodePort instead of ClusterIP.

I think ClusterIp only work with ingress like nginx, but for cloud providers like gke, eks or aks you need to use nodeport with the native cloud ingress

ArkShocer commented 8 months ago

ClusterIP is fine for our configuration since we use nginx ingress for our aks without public ip. If we would use the native solutions then yes nodeport or loadbalancer would be better.