bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9.01k stars 9.22k forks source link

mongodb-sharded: pods crashing #3646

Closed fgsalomon closed 4 years ago

fgsalomon commented 4 years ago

Which chart: mongodb-sharded-2.1.1

Describe the bug Pods keep on crashing. It seems connections are being refused on pods. image

Mongos:

Liveness probe failed: MongoDB shell version v4.4.1
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
connect@src/mongo/shell/mongo.js:374:17
@(connect):2:6
exception: connect failed
exiting with code 1

On the logs of mongos:

2020-09-10T08:45:01.693819099Z time="2020-09-10T08:45:01Z" level=error msg="Problem gathering the mongo server version: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: localhost:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : dial tcp 127.0.0.1:27017: connect: connection refused }, ] }" source="mongodb_collector.go:195"
2020-09-10T08:45:01.693682609Z time="2020-09-10T08:45:01Z" level=error msg="Could not get MongoDB BuildInfo: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: localhost:27017, Type: Unknown, State: Connected, Average RTT: 0, Last error: connection() : dial tcp 127.0.0.1:27017: connect: connection refused }, ] }!" source="connection.go:84"

Arbiter:

Readiness probes are failing with: `Readiness probe failed: dial tcp 10.16.8.6:27017: connect: connection refused`

Shard:

connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
connect@src/mongo/shell/mongo.js:374:17
@(connect):2:6
exception: connect failed
exiting with code 1

To Reproduce

Expected behavior MongoDB is working

Version of Helm and Kubernetes: Helm:

version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}

Kubernetes:

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.10", GitCommit:"575467a0eaf3ca1f20eb86215b3bde40a5ae617a", GitTreeState:"clean", BuildDate:"2019-12-11T12:41:00Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.42", GitCommit:"42bef28c2031a74fc68840fce56834ff7ea08518", GitTreeState:"clean", BuildDate:"2020-06-02T16:07:00Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
fgsalomon commented 4 years ago

If I use the default values (just setting the node selectors) with:

helm install mongodb bitnami/mongodb-sharded --namespace mongodb --set shardsvr.dataNode.nodeSelector.cloud\\.google\\.com/gke-nodepool=pool-mongodb,configsvr.nodeSelector.cloud\\.google\\.com/gke-nodepool=pool-mongodb,mongos.nodeSelector.cloud\\.google\\.com/gke-nodepool=pool-mongodb,shardsvr.arbiter.nodeSelector.cloud\\.google\\.com/gke-nodepool=pool-mongodb

I get the same results.

carrodher commented 4 years ago

Hi, are you able to reproduce the issue without setting the nodeSelector? Just helm install mongodb bitnami/mongodb-sharded.

With your current deployment, what is the result of kubectl get pods --namespace mongodb, it seems like a "networking" issue because some nodes (or the application itself) are not reachable or producing a connection error.

fgsalomon commented 4 years ago

Hi, are you able to reproduce the issue without setting the nodeSelector? Just helm install mongodb bitnami/mongodb-sharded.

With your current deployment, what is the result of kubectl get pods --namespace mongodb, it seems like a "networking" issue because some nodes (or the application itself) are not reachable or producing a connection error.

Without setting the nodeSelector it works:

kubectl get pods --namespace mongodbsharded                                                                            
NAME                                                     READY   STATUS    RESTARTS   AGE
mongodbsharded-mongodb-sharded-configsvr-0               1/1     Running   0          4m59s
mongodbsharded-mongodb-sharded-mongos-79cfc64446-xjl9s   1/1     Running   0          4m59s
mongodbsharded-mongodb-sharded-shard0-data-0             1/1     Running   0          4m59s
mongodbsharded-mongodb-sharded-shard1-data-0             1/1     Running   0          4m59s
carrodher commented 4 years ago

It's weird, can you double-check if there is any issue with the nodeSelector syntax or the way you're setting it?

You can use kubectl get nodes --show-labels and then compare it with the following outputs:

Can you connect to any pod and manually execute the readiness command (mongo --eval "db.adminCommand('ping')")? Is the above kubectl get pods output from a deployment with or without nodeSelector?

fgsalomon commented 4 years ago

Is the above kubectl get pods output from a deployment with or without nodeSelector?

The above was without setting nodeSelector.

If delete the chart and install it again setting nodeSelector running kubectl get pods returns:

NAME                                                     READY   STATUS    RESTARTS   AGE
mongodbsharded-mongodb-sharded-configsvr-0               1/1     Running   0          6m50s
mongodbsharded-mongodb-sharded-mongos-7f56876864-8wn47   0/1     Running   2          6m50s
mongodbsharded-mongodb-sharded-shard0-data-0             0/1     Running   2          6m50s
mongodbsharded-mongodb-sharded-shard1-data-0             0/1     Running   2          6m50s

Only the config server is running, the other pods keep crashing.

Can you connect to any pod and manually execute the readiness command (mongo --eval "db.adminCommand('ping')")?

The config server returns this:

$ mongo --eval "db.adminCommand('ping')"
MongoDB shell version v4.4.1
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("13cda720-ab2d-48b8-9b20-bdd771586608") }
MongoDB server version: 4.4.1
{
    "ok" : 1,
    "$gleStats" : {
        "lastOpTime" : Timestamp(0, 0),
        "electionId" : ObjectId("7fffffff0000000000000004")
    },
    "lastCommittedOpTime" : Timestamp(1599826672, 1),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1599826672, 1),
        "signature" : {
            "hash" : BinData(0,"JWicsLnKm3kg1ipTTP0myzhkxB0="),
            "keyId" : NumberLong("6871092725999992854")
        }
    },
    "operationTime" : Timestamp(1599826672, 1)
}

The shard and mongos pods both return:

$ mongo --eval "db.adminCommand('ping')"
MongoDB shell version v4.4.1
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused :
connect@src/mongo/shell/mongo.js:374:17
@(connect):2:6
exception: connect failed
exiting with code 1

The labels on the nodes seem right:

gke-my-cluster-staging-pool-mongodb-59d757f0-2pmg    Ready    <none>   28h   v1.14.10-gke.42   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/instance-type=n1-standard-1,beta.kubernetes.io/os=linux,cloud.google.com/gke-nodepool=pool-mongodb,cloud.google.com/gke-os-distribution=cos,failure-domain.beta.kubernetes.io/region=europe-west1,failure-domain.beta.kubernetes.io/zone=europe-west1-b,kubernetes.io/arch=amd64,kubernetes.io/hostname=gke-my-cluster-staging-pool-mongodb-59d757f0-2pmg,kubernetes.io/os=linux

The node selectors on the pods also seem correct:

kubectl -n mongodbsharded describe pod mongodbsharded-mongodb-sharded-shard1-data-0 | grep Node-Selectors
Node-Selectors:  cloud.google.com/gke-nodepool=pool-mongodb
kubectl -n mongodbsharded describe pod mongodbsharded-mongodb-sharded-shard0-data-0 | grep Node-Selectors          
Node-Selectors:  cloud.google.com/gke-nodepool=pool-mongodb
kubectl -n mongodbsharded describe pod mongodbsharded-mongodb-sharded-mongos-7f56876864-8wn47 | grep Node-Selectors
Node-Selectors:  cloud.google.com/gke-nodepool=pool-mongodb
kubectl -n mongodbsharded describe pod mongodbsharded-mongodb-sharded-configsvr-0 | grep Node-Selectors            
Node-Selectors:  cloud.google.com/gke-nodepool=pool-mongodb

The output of the helm template ... command also matches:


      nodeSelector:
        cloud.google.com/gke-nodepool: pool-mongodb
      affinity:
        {}
      tolerations:
--
      nodeSelector:
        cloud.google.com/gke-nodepool: pool-mongodb
      affinity:
        {}
      tolerations:
--
      nodeSelector:
        cloud.google.com/gke-nodepool: pool-mongodb
      affinity:
        {}
      tolerations:
--
      nodeSelector:
        cloud.google.com/gke-nodepool: pool-mongodb
      affinity:
        {}
      tolerations:
carrodher commented 4 years ago

It's weird as per your outputs, the nodeSelector is set in the same way in all the pods. In the same way, the ConfigServer is fully operational, the status is READY and the liveness probe works manually.

On the other hand, the rest of the pods are being restarted even with the same nodeSelector, probably the issue is the same in all of them, so let's pick up one and run some commands:

kubectl describe pod PODNAME
kubectl logs -f PODNAME

where PODNAME is the name of one of the failing ones, for example mongodbsharded-mongodb-sharded-shard0-data-0.

As the pods are in a RUNNING state (but not ready), I guess the issue is only related to the probes because if it is something related to the label, the pods shouldn't be in a RUNNING state but in PENDING, then with the kubectl describe pod command something like

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  30s (x2 over 30s)  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.

should appear, but this is not going to be the case. We need to find why the probes are working on one pod but not in the rest, let's see if we have more information with the describe or log commands

fgsalomon commented 4 years ago
kubectl -n mongodbsharded describe pod mongodbsharded-mongodb-sharded-shard0-data-0 
Output ``` Name: mongodbsharded-mongodb-sharded-shard0-data-0 Namespace: mongodbsharded Priority: 0 PriorityClassName: Node: gke-mycluster-pool-mongodb-59d757f0-8zrg/10.132.15.209 Start Time: Mon, 14 Sep 2020 08:57:23 +0200 Labels: app.kubernetes.io/component=shardsvr app.kubernetes.io/instance=mongodbsharded app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=mongodb-sharded controller-revision-hash=mongodbsharded-mongodb-sharded-shard0-data-cc99fbdd helm.sh/chart=mongodb-sharded-2.1.1 shard=0 statefulset.kubernetes.io/pod-name=mongodbsharded-mongodb-sharded-shard0-data-0 Annotations: Status: Running IP: 10.16.6.3 Controlled By: StatefulSet/mongodbsharded-mongodb-sharded-shard0-data Containers: mongodb: Container ID: docker://89b76bc6848084e66f8741ec4d0c8251f402fa8259860035d41cba0f2f3e1e2d Image: docker.io/bitnami/mongodb-sharded:4.4.1-debian-10-r0 Image ID: docker-pullable://bitnami/mongodb-sharded@sha256:54efa9d89cf66670a02ff1403f1af6977ca9a44085dc76d48d543020a9b82da0 Port: 27017/TCP Host Port: 0/TCP Command: /bin/bash /entrypoint/replicaset-entrypoint.sh State: Running Started: Mon, 14 Sep 2020 08:57:40 +0200 Ready: False Restart Count: 0 Liveness: exec [pgrep mongod] delay=60s timeout=5s period=10s #success=1 #failure=6 Readiness: exec [mongo --eval db.adminCommand('ping')] delay=60s timeout=5s period=10s #success=1 #failure=6 Environment: MONGODB_SYSTEM_LOG_VERBOSITY: 0 MONGODB_MAX_TIMEOUT: 120 MONGODB_DISABLE_SYSTEM_LOG: no MONGODB_SHARDING_MODE: shardsvr MONGODB_POD_NAME: mongodbsharded-mongodb-sharded-shard0-data-0 (v1:metadata.name) MONGODB_MONGOS_HOST: mongodbsharded-mongodb-sharded MONGODB_INITIAL_PRIMARY_HOST: mongodbsharded-mongodb-sharded-shard0-data-0.mongodbsharded-mongodb-sharded-headless.mongodbsharded.svc.cluster.local MONGODB_REPLICA_SET_NAME: mongodbsharded-mongodb-sharded-shard-0 MONGODB_ADVERTISED_HOSTNAME: $(MONGODB_POD_NAME).mongodbsharded-mongodb-sharded-headless.mongodbsharded.svc.cluster.local MONGODB_ROOT_PASSWORD: Optional: false MONGODB_REPLICA_SET_KEY: Optional: false MONGODB_ENABLE_IPV6: no MONGODB_ENABLE_DIRECTORY_PER_DB: no Mounts: /bitnami/mongodb from datadir (rw) /entrypoint from replicaset-entrypoint-configmap (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-w2fwq (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: datadir: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: datadir-mongodbsharded-mongodb-sharded-shard0-data-0 ReadOnly: false replicaset-entrypoint-configmap: Type: ConfigMap (a volume populated by a ConfigMap) Name: mongodbsharded-mongodb-sharded-replicaset-entrypoint Optional: false default-token-w2fwq: Type: Secret (a volume populated by a Secret) SecretName: default-token-w2fwq Optional: false QoS Class: BestEffort Node-Selectors: cloud.google.com/gke-nodepool=pool-mongodb Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m5s default-scheduler Successfully assigned mongodbsharded/mongodbsharded-mongodb-sharded-shard0-data-0 to gke-mycluster-pool-mongodb-59d757f0-8zrg Normal SuccessfulAttachVolume 119s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-d78c3b84-f3ec-11ea-b314-42010a840022" Normal Pulled 108s kubelet, gke-mycluster-pool-mongodb-59d757f0-8zrg Container image "docker.io/bitnami/mongodb-sharded:4.4.1-debian-10-r0" already present on machine Normal Created 108s kubelet, gke-mycluster-pool-mongodb-59d757f0-8zrg Created container mongodb Normal Started 108s kubelet, gke-mycluster-pool-mongodb-59d757f0-8zrg Started container mongodb Warning Unhealthy 8s (x5 over 48s) kubelet, gke-mycluster-pool-mongodb-59d757f0-8zrg Liveness probe failed: Warning Unhealthy 6s (x5 over 46s) kubelet, gke-mycluster-pool-mongodb-59d757f0-8zrg Readiness probe failed: MongoDB shell version v4.4.1 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused : connect@src/mongo/shell/mongo.js:374:17 @(connect):2:6 exception: connect failed exiting with code 1 ```
kubectl -n mongodbsharded logs mongodbsharded-mongodb-sharded-shard0-data-0

outputs:

 07:04:45.87 INFO  ==> Setting node as primary
mongodb 07:04:45.90 
mongodb 07:04:45.90 Welcome to the Bitnami mongodb-sharded container
mongodb 07:04:45.90 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-mongodb-sharded
mongodb 07:04:45.90 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-mongodb-sharded/issues
mongodb 07:04:45.90 
mongodb 07:04:45.90 INFO  ==> ** Starting MongoDB Sharded setup **
mongodb 07:04:45.93 INFO  ==> Validating settings in MONGODB_* env vars...
mongodb 07:04:45.94 INFO  ==> Initializing MongoDB Sharded...
mongodb 07:04:45.96 INFO  ==> Writing keyfile for replica set authentication...
mongodb 07:04:45.97 INFO  ==> Enabling authentication...
mongodb 07:04:45.97 INFO  ==> Deploying MongoDB Sharded with persisted data...
mongodb 07:04:45.99 INFO  ==> Trying to connect to MongoDB server mongodbsharded-mongodb-sharded...
timeout reached before the port went into state "inuse"

I've done some tests these past days (installing/deleting the chart, increasing/decreasing the number of nodes, etc) and I can no longer get mongodb working even without setting the node selectors. So it seems that wasn't the issue. Anyway, the above outputs are from a installed chart setting the node selector since I need that for the production cluster.

carrodher commented 4 years ago

Ok, it is clear that the rest of the pods are failing because of some kind of unreachability. Although everything seems properly configured in terms of NodeSelector, I am trying to reproduce the issue but without luck.

I installed the chart following two different approaches:

Then by running kubectl describe pod POD_NAME | grep 'Node' in all the pods I was able to see that the assigned node (Node: ...) and the desired node (Node-Selectors: ...) is the same, so the pods are assigned to the same desired node:

kubectl describe pod mongodb-mongodb-sharded-configsvr-0 | grep 'Node' && \
kubectl describe pod mongodb-mongodb-sharded-mongos-7f965f859-chs2q | grep 'Node' && \
kubectl describe pod mongodb-mongodb-sharded-shard0-data-0 | grep 'Node' && \
kubectl describe pod mongodb-mongodb-sharded-shard1-data-0 | grep 'Node' && \
kubectl describe pod mongodb2-mongodb-sharded-configsvr-0 | grep 'Node'

Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  kubernetes.io/hostname=gke-carlos-cluster-default-pool-fa254d68-v9x6
Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  kubernetes.io/hostname=gke-carlos-cluster-default-pool-fa254d68-v9x6
Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  kubernetes.io/hostname=gke-carlos-cluster-default-pool-fa254d68-v9x6
Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  kubernetes.io/hostname=gke-carlos-cluster-default-pool-fa254d68-v9x6

In theory, this is the scenario you are looking for, but in this case, everything is up and running:

$ kubectl get pods
NAME                                              READY   STATUS    RESTARTS   AGE
mongodb-mongodb-sharded-configsvr-0               1/1     Running   0          10m
mongodb-mongodb-sharded-mongos-7f965f859-chs2q    1/1     Running   0          10m
mongodb-mongodb-sharded-shard0-data-0             1/1     Running   0          10m
mongodb-mongodb-sharded-shard1-data-0             1/1     Running   0          10m

Now I installed the chart but without setting any nodeSelector

$ helm install mongodb2 bitnami/mongodb-sharded

As expected, each pod can be assigned (or not) to different nodes:

kubectl describe pod mongodb2-mongodb-sharded-configsvr-0 | grep 'Node' && \
kubectl describe pod mongodb2-mongodb-sharded-mongos-c8cc5f68b-w6hsv | grep 'Node' && \
kubectl describe pod mongodb2-mongodb-sharded-shard0-data-0 | grep 'Node' && \
kubectl describe pod mongodb2-mongodb-sharded-shard1-data-0 | grep 'Node'

Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  <none>
Node:         gke-carlos-cluster-default-pool-fa254d68-9se9/10.142.0.46
Node-Selectors:  <none>
Node:         gke-carlos-cluster-default-pool-fa254d68-v9x6/10.142.0.44
Node-Selectors:  <none>
Node:         gke-carlos-cluster-default-pool-fa254d68-ak6i/10.142.0.47
Node-Selectors:  <none>

in this case, Node-Selectors is empty as I didn't specify anything so each pod is assigned to different nodes.

Also in this case everything is up and running:

$ kubectl get pods
NAME                                              READY   STATUS    RESTARTS   AGE
mongodb2-mongodb-sharded-configsvr-0              1/1     Running   0          17m
mongodb2-mongodb-sharded-mongos-c8cc5f68b-w6hsv   1/1     Running   0          17m
mongodb2-mongodb-sharded-shard0-data-0            1/1     Running   0          17m
mongodb2-mongodb-sharded-shard1-data-0            1/1     Running   0          17m
fgsalomon commented 4 years ago

Ok, it is clear that the rest of the pods are failing because of some kind of unreachability. Although everything seems properly configured in terms of NodeSelector, I am trying to reproduce the issue but without luck.

How can I find what causes this unreachability? I'm new to MongoDB and don't know where to start looking. What baffles me it's that the behavior it's not deterministic. I've deployed the chart many times this morning with different combinations and a couple of times it did work. I guess that the issue has to be related to the state of my cluster but I don't see how.

carrodher commented 4 years ago

Are you able to connect to your MongoDB by following the instructions that appear in the installation notes? You can see those instructions as any time by running helm get notes NAME:

$ helm get notes mongodb
NOTES:
** Please be patient while the chart is being deployed **

The MongoDB Sharded cluster can be accessed via the Mongos instances in port 27017 on the following DNS name from within your cluster:

    mongodb-mongodb-sharded.default.svc.cluster.local

To get the root password run:

    export MONGODB_ROOT_PASSWORD=$(kubectl get secret --namespace default mongodb-mongodb-sharded -o jsonpath="{.data.mongodb-root-password}" | base64 --decode)

To connect to your database run the following command:

    kubectl run --namespace default mongodb-mongodb-sharded-client --rm --tty -i --restart='Never' --image docker.io/bitnami/mongodb-sharded:4.4.1-debian-10-r0 --command -- mongo admin --host mongodb-mongodb-sharded

To connect to your database from outside the cluster execute the following commands:

    kubectl port-forward --namespace default svc/mongodb-mongodb-sharded 27017:27017 &
    mongo --host 127.0.0.1 --authenticationDatabase admin -p $MONGODB_ROOT_PASSWORD

Maybe kubectl get events can help to see if there is something else in the cluster that avoid the chart to be fully operational.

Apart from that, I would check if you have any kind of restriction in your cluster in terms of networking; not sure if for this case it is more useful to check any GKE support page or forum as the issue seems not related to the chart itself

fgsalomon commented 4 years ago

--authenticationDatabase admin -p $MONGODB_ROOT_PASSWORD

Yes, on the rare occasion that the chart is deployed successfully I can connect to MongoDB. I think you are right and the issue is not related to the chart itself so I will check the GKE cluster. Thank you very much for your help @carrodher !!