cockroachdb / helm-charts

Helm charts for cockroachdb
Apache License 2.0
85 stars 148 forks source link

Failed to create CockroachDB cluster using Helm provider in Terraform #400

Closed lin-crl closed 4 months ago

lin-crl commented 4 months ago

Describe the problem

When using TF Helm provider to deploy CRDB on AKS, the init job doesn't seem to get started, and cluster cannot be created. When using Helm commands directly on the same AKS cluster with the same cockroachdb-values.yaml, however the cluster is deployed successfully.

To Reproduce Steps to reproduce the behavior:

  1. Set up an AKS cluster Feel free to refer to here to create a test cluster

  2. Use TF & Helm provider to deploy CRDB create a terraform file as such

    terraform {
    required_version = ">=1.0"
    required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = "=2.14.0"
    }
    }
    }
    provider "helm" {
    debug   = true
    kubernetes {
    config_path = "~/.kube/config"
    }
    }
    resource "helm_release" "cockroachdb" {
    name       = "cockroachdb"
    repository = "https://charts.cockroachdb.com/"
    chart      = "cockroachdb"
    
    values = [
    file("${path.module}/cockroachdb-values.yaml")
    ]
    }

    here's the custom cockroachdb-values.yaml

    image:
    tag: v23.2.5
    conf:
    cache: 25%
    max-sql-memory: 25%
  3. Look at kubernetes pods, there's no init job pod. This can be verified in Azure UI as well

    NAME                                  READY   STATUS      RESTARTS   AGE
    cockroachdb-0                         0/1     Running     0          3m57s
    cockroachdb-1                         0/1     Running     0          3m57s
    cockroachdb-2                         0/1     Running     0          3m57s
  4. See error in kubectl logs cockroachdb-0

    I240709 23:23:41.103749 14 server/init.go:228 ⋮ [T1,Vsystem,n?] 43  awaiting `cockroach init` or join with an already initialized node
    W240709 23:23:41.110554 123 server/init.go:404 ⋮ [T1,Vsystem,n?] 44  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:41.119094 123 server/init.go:402 ⋮ [T1,Vsystem,n?] 45  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:42.124047 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 46  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:43.129048 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 47  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:44.123785 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 48  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:45.127099 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 49  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:46.125795 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 50  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:47.126533 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 51  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:48.123168 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 52  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:49.126276 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 53  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:50.125580 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 54  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:51.126462 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 55  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:52.126154 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 56  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:53.137873 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 57  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:54.122485 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 58  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:55.125788 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 59  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:56.124232 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 60  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:57.126330 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 61  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:23:58.124488 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 62  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:23:59.127139 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 63  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:24:00.122725 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 64  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:24:01.126040 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 65  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    .... <removed same error> 
    W240709 23:24:02.130443 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 66  outgoing join rpc to ‹cockroachdb-W240709 23:24:04.123512 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 68  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:24:05.126184 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 69  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:24:06.123945 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 70  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:24:07.126304 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 71  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:24:08.126469 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 72  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    I240709 23:24:09.128530 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 73  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    W240709 23:24:10.123691 123 server/init.go:450 ⋮ [T1,Vsystem,n?] 74  outgoing join rpc to ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup cockroachdb-1.cockroachdb.default.svc.cluster.local: no such host"›
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75  The server appears to be unable to contact the other nodes in the cluster. Please try:
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +- starting the other nodes, if you haven't already;
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +- running the 'cockroach init' command if you are trying to initialize a new cluster.
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +
    W240709 23:24:11.089036 548 1@cli/start.go:604 ⋮ [T1,Vsystem,n?] 75 +If problems persist, please see ‹https://www.cockroachlabs.com/docs/v23.2/cluster-setup-troubleshooting.html›.
    I240709 23:24:11.125989 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 76  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    I240709 23:24:12.129267 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 77  ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    I240709 23:24:13.145371 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 78  ‹cockroachdb-2.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    I240709 23:24:14.128494 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 79  ‹cockroachdb-1.cockroachdb.default.svc.cluster.local:26257› is itself waiting for init, will retry
    I240709 23:24:15.126806 123 server/init.go:448 ⋮ [T1,Vsystem,n?] 80  
    <removed same retry messages>

    I also used dnsutils and it show the host name can be resolved.

    
    (⎈|lin-k8s:N/A)lin@crlMBP-C02DT57JMD6TMzAy cockroachlabs-openai % kubectl exec -i -t dnsutils -- nslookup cockroachdb-1.cockroachdb.default.svc.cluster.local
    Server:         10.0.0.10
    Address:        10.0.0.10#53

Name: cockroachdb-1.cockroachdb.default.svc.cluster.local Address: 192.168.10.15

On Terraform side, it timed out in error

helm_release.cockroachdb: Still creating... [5m0s elapsed] helm_release.cockroachdb: Still creating... [5m10s elapsed] ╷ │ Warning: Helm release "cockroachdb" was created but has a failed status. Use the helm command to investigate the error, correct it, then run Terraform again. │ │ with helm_release.cockroachdb, │ on cockroachdb.tf line 22, in resource "helm_release" "cockroachdb": │ 22: resource "helm_release" "cockroachdb" { │ ╵ ╷ │ Error: context deadline exceeded │ │ with helm_release.cockroachdb, │ on cockroachdb.tf line 22, in resource "helm_release" "cockroachdb": │ 22: resource "helm_release" "cockroachdb" { │



**Expected behavior**
Able to deploy CRDB using Helm provider in TF.
prafull01 commented 4 months ago

We are hitting the following issue on hashicorp/helm TF provider where if wait is enabled, it doesn't execute the post install hooks for the helm release. https://github.com/hashicorp/terraform-provider-helm/issues/683

You can use following terraform script to avoid this issue. You just need to add wait = false in the TF code.

terraform {
  required_version = ">=1.0"
  required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = "=2.14.0"
    }
  }
}
provider "helm" {
  debug   = true
  kubernetes {
   config_path = "~/.kube/config"
  }
}
resource "helm_release" "cockroachdb" {
  name       = "cockroachdb"
  repository = "https://charts.cockroachdb.com/"
  chart      = "cockroachdb"
  wait       = false
  values = [
    file("${path.module}/cockroachdb-values.yaml")
  ]
}
Screenshot 2024-07-11 at 6 02 29 PM Screenshot 2024-07-11 at 6 02 37 PM
lin-crl commented 4 months ago

@prafull01 appreciate your help. I confirm the workaround above worked. We can close this issue.