cockroachdb / helm-charts

Helm charts for cockroachdb
Apache License 2.0
85 stars 148 forks source link

CockroachDB doesnt start and init doesnt seem to launch #402

Open DelaunayAntoine opened 4 months ago

DelaunayAntoine commented 4 months ago

Hello everyone,

I would like to deploy cockroachDB using helm but the problem is that the cluster can't start and I get this error that keeps appearing: Error I240718 13:09:06.023104 191 server/init.go:405 ⋮ [T1,Vsystem,n?] 37 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry

Can you help me by giving me some hints on how to fix the problem?

Here's the entire log file and the values.yaml file db.txt values-cockroach.txt

I'm using cockroach version 24.1.1 The chart 13.0.1

What do you expect to see ?

The cockroach cluster launching just fine

What happened

Error I240718 13:09:06.023104 191 server/init.go:405 ⋮ [T1,Vsystem,n?] 37 ‹cockroachdb-1.cockroachdb.cockroachdb.svc.cluster.local:26257› is itself waiting for init, will retry

lknite commented 2 months ago

I'm seeing this as well. Did you figure it out?

I see this in the log, it looks like its trying the wrong url to the pods:

W240826 16:03:30.251454 142 server/init.go:407 ⋮ [T1,Vsystem,n?] 37  outgoing join rpc to ‹keycloak-cockroachdb-1.keycloak-cockroachdb.keycloak.svc.cluster.local:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp: lookup keycloak-cockroachdb-1.keycloak-cockroachdb.keycloak.svc.cluster.local: no such host"›
I240826 16:03:30.258373 142 server/init.go:405 ⋮ [T1,Vsystem,n?] 38  ‹keycloak-cockroachdb-2.keycloak-cockroachdb.keycloak.svc.cluster.local:26257› is itself waiting for init, will retr

In my case its adding in 'keycloak-cockroachdb.' and in your case its adding in 'cockroachdb.', which it looks like it shouldn't be.

apavarnitsyn commented 1 month ago

I've got the same problem with the latest 14.0.3 chart. I suppose that the reason is in helm hooks annotations of init job template. https://github.com/cockroachdb/helm-charts/blob/master/cockroachdb/templates/job.init.yaml#L22 Post-install hook can't be triggered because the stateful set is not ready. As a workaround you may deploy the init job manifest from the template manually.

udnay commented 1 month ago

Are either of you able to share your values file? A redacted version is likely fine, just to see what overrides you have set. I have. not ben able to reproduce this with the default values.

lknite commented 1 week ago

@udnay , here ya go:

$ cat Chart.yaml 
apiVersion: v2
name: jellyfin
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
appVersion: "1.0"

dependencies:
- name: jellyfin
  version: 2.1.0
  repository: https://jellyfin.github.io/jellyfin-helm
- name: nats
  version: 1.1.10
  repository: https://nats-io.github.io/k8s/helm/charts/
- name: cockroachdb
  version: 14.0.5
  repository: https://charts.cockroachdb.com

$ cat values.yaml 
nats:

  natsBox:
    enabled: false

Result:

$ k --context prod-admin@prod -n jellyfin get ing,pvc,all
NAME                                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/datadir-jellyfin-cockroachdb-0   Bound    pvc-1098f4ee-2a61-4dc0-944e-d4af39b1e95a   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/datadir-jellyfin-cockroachdb-1   Bound    pvc-6daf0dfa-1f12-432d-9c22-8636433d1c82   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/datadir-jellyfin-cockroachdb-2   Bound    pvc-ff28ab8d-ea1a-46cd-9d07-ef8d6367899f   100Gi      RWO            cephfs         <unset>                 11m
persistentvolumeclaim/jellyfin-config                  Bound    pvc-de299b6e-a5b4-4926-a8da-e70f93c9fcfa   5Gi        RWO            cephfs         <unset>                 11m
persistentvolumeclaim/jellyfin-media                   Bound    pvc-f921cdea-c8a9-4c0c-a98b-fb46368fa90b   25Gi       RWO            cephfs         <unset>                 11m

NAME                            READY   STATUS    RESTARTS        AGE
pod/jellyfin-6898c4c4bf-m2jl6   1/1     Running   0               11m
pod/jellyfin-cockroachdb-0      0/1     Running   1 (5m3s ago)    11m
pod/jellyfin-cockroachdb-1      0/1     Running   1 (4m27s ago)   11m
pod/jellyfin-cockroachdb-2      0/1     Running   1 (4m25s ago)   11m
pod/jellyfin-nats-0             2/2     Running   0               11m

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)              AGE
service/jellyfin                      ClusterIP   10.105.39.215   <none>        8096/TCP             11m
service/jellyfin-cockroachdb          ClusterIP   None            <none>        26257/TCP,8080/TCP   11m
service/jellyfin-cockroachdb-public   ClusterIP   10.110.42.45    <none>        26257/TCP,8080/TCP   11m
service/jellyfin-nats                 ClusterIP   10.97.28.92     <none>        4222/TCP             11m
service/jellyfin-nats-headless        ClusterIP   None            <none>        4222/TCP,8222/TCP    11m

Logs of each cockroachdb pod show:

$ k --context prod-admin@prod -n jellyfin logs -f jellyfin-cockroachdb-0 
Defaulted container "db" out of: db, copy-certs (init)
++ hostname
+ exec /cockroach/cockroach start --join=jellyfin-cockroachdb-0.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257,jellyfin-cockroachdb-1.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257,jellyfin-cockroachdb-2.jellyfin-cockroachdb.jellyfin.svc.cluster.local:26257 --advertise-host=jellyfin-cockroachdb-0.jellyfin-cockroachdb.jellyfin.svc.cluster.local --certs-dir=/cockroach/cockroach-certs/ --http-port=8080
*
* WARNING: Running a server without --sql-addr, with a combined RPC/SQL listener, is deprecated.
* This feature will be removed in a later version of CockroachDB.
*
*
* INFO: initial startup completed.
* Node will now attempt to join a running cluster, or wait for `cockroach init`.
* Client connections will be accepted after this completes successfully.
* Check the log file(s) for progress. 
*
*
* WARNING: The server appears to be unable to contact the other nodes in the cluster. Please try:
* 
* - starting the other nodes, if you haven't already;
* - double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
* - running the 'cockroach init' command if you are trying to initialize a new cluster.
* 
* If problems persist, please see https://www.cockroachlabs.com/docs/v24.2/cluster-setup-troubleshooting.html.
*