apache / openwhisk-deploy-kube

The Apache OpenWhisk Kubernetes Deployment repository supports deploying the Apache OpenWhisk system on Kubernetes and OpenShift clusters.
https://openwhisk.apache.org/
Apache License 2.0
297 stars 231 forks source link

couchdb not available #678

Closed jpchev closed 3 years ago

jpchev commented 3 years ago

hello, I'm struggling with deploying Openwhisk on a Kubernetes cluster. It seems the owdev-init-couchdb pod doesn't complete as it waits for couchdb

kubectl get pod -n openwhisk -w
NAME                                   READY   STATUS      RESTARTS   AGE
owdev-alarmprovider-687f79859b-rmmzq   0/1     Init:0/1    0          5m22s
owdev-apigateway-bccbbcd67-tt7s5       1/1     Running     0          5m22s
owdev-controller-0                     0/1     Init:1/2    0          5m22s
owdev-couchdb-595b88565-qlh62          1/1     Running     0          5m22s
owdev-gen-certs-5p8rd                  0/1     Completed   0          5m22s
owdev-init-couchdb-xvs4d               0/1     Completed   0          5m22s
owdev-install-packages-862s4           0/1     Init:0/1    0          5m22s
owdev-invoker-0                        0/1     Init:0/1    0          5m16s
owdev-kafka-0                          1/1     Running     2          5m16s
owdev-kafkaprovider-5574d4bf5f-stlxj   0/1     Init:0/1    0          5m22s
owdev-nginx-86749d59cb-5dkln           0/1     Init:0/1    0          5m22s
owdev-redis-5dc8d75b55-wvpdh           1/1     Running     0          5m22s
owdev-zookeeper-0                      1/1     Running     0          4m

The last line of kubectl describe pod owdev-controller-0 -n openwhisk says Normal Started 3m23s kubelet Started container wait-for-couchdb

I disabled the persistence in mycluster.yaml

jpchev commented 3 years ago

it seems that this line waits forever

https://github.com/apache/openwhisk-deploy-kube/blob/1.0.0/helm/openwhisk/configMapFiles/initCouchDB/initdb.sh#L52

why is it that couchdb is not found?

This behaviour is random, sometimes the error doesn't occur. I've reduced the number of k8s nodes to one, and now I can create and call action.

rabbah commented 3 years ago

Try kubectl describe pod on the couch pod to see some details?

jpchev commented 3 years ago
kubectl describe pod owdev-couchdb-595b88565-8dtkj -n openwhisk

Name:         owdev-couchdb-595b88565-8dtkj
Namespace:    openwhisk
Priority:     0
Node:         xxx-w1-vm/192.168.1.37
Start Time:   Fri, 05 Mar 2021 15:09:45 +0000
Labels:       app=owdev-openwhisk
              chart=openwhisk-1.0.0
              heritage=Helm
              name=owdev-couchdb
              pod-template-hash=595b88565
              release=owdev
Annotations:  <none>
Status:       Running
IP:           10.44.0.6
IPs:
  IP:           10.44.0.6
Controlled By:  ReplicaSet/owdev-couchdb-595b88565
Containers:
  couchdb:
    Container ID:   docker://158ea6fce985a5de40a38cf2e21b877f04e2794a5961d5c7e93004b0f31f358c
    Image:          apache/couchdb:2.3
    Image ID:       docker-pullable://apache/couchdb@sha256:9f895c8ae371cb895541e53100e039ac6ae5d30f6f0b199e8470d81d523537ad
    Port:           5984/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 05 Mar 2021 15:11:09 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      COUCHDB_USER:      <set to the key 'db_username' in secret 'owdev-db.auth'>  Optional: false
      COUCHDB_PASSWORD:  <set to the key 'db_password' in secret 'owdev-db.auth'>  Optional: false
      NODENAME:          couchdb0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-trpk6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-trpk6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-trpk6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 openwhisk-role=core:NoSchedule
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  3m43s  default-scheduler  Successfully assigned openwhisk/owdev-couchdb-595b88565-8dtkj to xxx-w1-vm
  Normal  Pulling    3m29s  kubelet            Pulling image "apache/couchdb:2.3"
  Normal  Pulled     2m28s  kubelet            Successfully pulled image "apache/couchdb:2.3" in 1m1.078872118s
  Normal  Created    2m19s  kubelet            Created container couchdb
  Normal  Started    2m19s  kubelet            Started container couchdb
dgrove-oss commented 3 years ago

I would investigate whether there is something misconfigured that is resulting in the curl command at https://github.com/apache/openwhisk-deploy-kube/blob/1.0.0/helm/openwhisk/configMapFiles/initCouchDB/initdb.sh#L51 not succeeding. Try doing a kubectl exec into the pod for the couchdb init job and executing that same curl manually.

jpchev commented 3 years ago

from a bash inside the couch init pod, the following command (corresponding to curl --output /dev/null --silent $DB_PROTOCOL://$DB_HOST:$DB_PORT/_utils) hangs

curl --output /dev/null http://owdev-couchdb.openwhisk.svc.cluster.local:5984/_utils

whereas the same command, using the IP of the couchdb pod works curl --output /dev/null http://10.44.0.4:5984/_utils

So it seems the IP of the couchdb service (which corresponds to owdev-couchdb.openwhisk.svc.cluster.local) doesn't reach the pod behind.

This seems to happen randomly, after installing openwhisk: sometimes the couchdb init pod completes (after that I have other availability problems when creating functions, but that's for another issue...).

I'm not sure, but I would say this issue doesn't happen when the couch db pod and the couch db init pod get deployed on the same Kubernetes node.

I've checked rules in iptables for the couchdb services, it seems the routing is in place:

sudo iptables-save | grep couchdb
-A KUBE-SEP-SNMSYAE7RYOIOXBL -s 10.44.0.4/32 -m comment --comment "openwhisk/owdev-couchdb:couchdb" -j KUBE-MARK-MASQ
-A KUBE-SEP-SNMSYAE7RYOIOXBL -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb" -m tcp -j DNAT --to-destination 10.44.0.4:5984
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.99.12.131/32 -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb cluster IP" -m tcp --dport 5984 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.99.12.131/32 -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb cluster IP" -m tcp --dport 5984 -j KUBE-SVC-4DTNEZVJQQUBR77C
-A KUBE-SVC-4DTNEZVJQQUBR77C -m comment --comment "openwhisk/owdev-couchdb:couchdb" -j KUBE-SEP-SNMSYAE7RYOIOXBL

and the endpoints are in place

kubectl get endpoints -n openwhisk
NAME               ENDPOINTS                                      AGE
owdev-apigateway   10.44.0.5:9000,10.44.0.5:8080                  29m
owdev-controller                                                  29m
owdev-couchdb      10.44.0.4:5984                                 29m
owdev-kafka        10.36.0.3:9092                                 29m
owdev-nginx                                                       29m
owdev-redis        10.36.0.1:6379                                 29m
owdev-zookeeper    10.44.0.7:2181,10.44.0.7:2888,10.44.0.7:3888   29m

Do you know where to look into, to have the couchdb cluster IP service be routed to the backing pod?

Do you have any idea?

Thanks

jpchev commented 3 years ago

it seems the problem got fixed after I replaced the CNI layer of Kubernetes: I used Weave, and I switched to Calico

satwikk commented 3 years ago

Strange how this is an issue for openwhisk namespace only. We faced this issue and found out that init-db job could not connect to couchdb.openshisk.svc.cluster.local, and curl http://couchdb.openshisk.svc.cluster.local:5984 timedout when run from another pod like wskadm. We could resolve the dns properly, but connections to couchdb failed. If we deploy openwhisk using helm charts on kubernetes (own setup on VMs) it requires other services like dynamic volumes (which we provide using rook or NFS), and those services do not face issue with service communication. I guess we need to look into this deeper, as replacing weave with calico might not always be the option.

rabbah commented 3 years ago

Is the port open/accessible?

satwikk commented 3 years ago

Yes, they were accessible in a way - Internally in their pod, and between pods on the same node. Sharing a case below, where we were running K8s 1.18.18 with weave 2.8.1 on CentOS7 3.10.0-1160.24.1.

After reproducing this issue, We went back to square one, where this might not be an issue with Openwhisk deployment or ow namespace only.

Sharing a case below, where we can connect (telnet) to pods and where we cannot,

[root@rook-ceph-tools-9wbw2 /]# nslookup rook-ceph-mgr.rook-ceph.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   rook-ceph-mgr.rook-ceph.svc.cluster.local
Address: 10.101.148.210

[root@rook-ceph-tools-9wbw2 /]# telnet rook-ceph-mgr.rook-ceph.svc.cluster.local 9283
Trying 10.101.148.210...
Connected to rook-ceph-mgr.rook-ceph.svc.cluster.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@rook-ceph-tools-9wbw2 /]# nslookup fn-couchdb.fn.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   fn-couchdb.fn.svc.cluster.local
Address: 10.99.20.83

[root@rook-ceph-tools-9wbw2 /]# telnet fn-couchdb.fn.svc.cluster.local 5984
Trying 10.99.20.83...

^C

root@fn-couchdb-848f8bb7c9-dswkt:/# cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.32.0.14      fn-couchdb-848f8bb7c9-dswkt
root@fn-couchdb-848f8bb7c9-dswkt:/# netstat -tupln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:5984            0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:9100            0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:4369            0.0.0.0:*               LISTEN      -
tcp6       0      0 :::4369                 :::*                    LISTEN      -
root@fn-couchdb-848f8bb7c9-dswkt:/# telnet localhost 5984
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

root@enfn-wskadmin:/# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
root@enfn-wskadmin:/# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:25:26--  http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 442 [application/json]
Remote file exists.

Success: CouchDB is ready!
root@enfn-wskadmin:/# telnet enfn-couchdb.fn.svc.cluster.local 5984
Trying 10.99.20.83...
Connected to enfn-couchdb.fn.svc.cluster.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@enfn-wskadmin:/#

root@enfn-couchdb-848f8bb7c9-dswkt:/# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
root@enfn-couchdb-848f8bb7c9-dswkt:/# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:23:40--  http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 442 [application/json]
Remote file exists.

Success: CouchDB is ready!

[root@rook-ceph-tools-9wbw2 /]# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker                                                              [root@rook-ceph-tools-9wbw2 /]# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:21--  http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... failed: Connection timed out.
Retrying.

Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:27--  (try: 2)  http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... failed: Connection timed out.
Retrying.

Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:34--  (try: 3)  http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... ^C

In This case above, whiskadmin and couchdb pods were on the same node. whereas in other cases they were not (might have not have been on the same node)

We will look more into this as time permits and update accordingly.

Took some time for me to get back and recreate the issue. Our test env was made up of three nodes,

NAME             STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME   LABELS
bm-k8s-master    Ready    master   22h   v1.18.18   10.99.97.118   <none>        CentOS Linux 7 (Core)   3.10.0-1160.24.1.el7.x86_64   docker://19.3.15    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=,openwhisk-role=invoker
bm-k8s-slave-2   Ready    <none>   18h   v1.18.18   10.99.97.116   <none>        CentOS Linux 7 (Core)   3.10.0-1160.24.1.el7.x86_64   docker://19.3.15    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-slave-2,kubernetes.io/os=linux,openwhisk-role=invoker
bm-k8s-slave-3   Ready    <none>   18h   v1.18.18   10.99.97.115   <none>        CentOS Linux 7 (Core)   3.10.0-1160.24.1.el7.x86_64   docker://19.3.15    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-slave-3,kubernetes.io/os=linux,openwhisk-role=invoker

Openwhisk landscape looks as follows,

kubectl -n fn get pod,svc,pvc -o wide                                                                                                                   Tue Apr 27 11:47:19 2021

NAME                                      READY   STATUS      RESTARTS   AGE   IP           NODE             NOMINATED NODE   READINESS GATES
pod/fn-alarmprovider-5b9c7b9b8d-sh8nm   0/1     Init:0/1    0          65m   10.36.0.10   bm-k8s-slave-3   <none>           <none>
pod/fn-apigateway-74b487d8cb-hn5sb      1/1     Running     0          65m   10.44.0.8    bm-k8s-slave-2   <none>           <none>
pod/fn-controller-0                     0/1     Init:1/2    0          65m   10.44.0.9    bm-k8s-slave-2   <none>           <none>
pod/fn-couchdb-848f8bb7c9-dswkt         1/1     Running     0          65m   10.32.0.14   bm-k8s-master    <none>           <none>
pod/fn-grafana-78dc6fcdff-smvm8         1/1     Running     0          65m   10.36.0.9    bm-k8s-slave-3   <none>           <none>
pod/fn-init-couchdb-fjn8s               0/1     Completed   0          65m   10.32.0.10   bm-k8s-master    <none>           <none>
pod/fn-install-packages-qgplt           0/1     Init:0/1    0          65m   10.32.0.12   bm-k8s-master    <none>           <none>
pod/fn-invoker-0                        0/1     Init:0/1    0          65m   10.32.0.13   bm-k8s-master    <none>           <none>
pod/fn-kafka-0                          1/1     Running     0          65m   10.44.0.10   bm-k8s-slave-2   <none>           <none>
pod/fn-nginx-5d7f747b95-25f7q           0/1     Init:0/1    0          65m   10.44.0.7    bm-k8s-slave-2   <none>           <none>
pod/fn-prometheus-server-0              1/1     Running     0          65m   10.36.0.11   bm-k8s-slave-3   <none>           <none>
pod/fn-redis-6d9f5f56b5-5gbrq           1/1     Running     0          65m   10.32.0.15   bm-k8s-master    <none>           <none>
pod/fn-user-events-7bf9665968-rpgsl     1/1     Running     1          65m   10.32.0.11   bm-k8s-master    <none>           <none>
pod/fn-wskadmin                         1/1     Running     0          65m   10.32.0.9    bm-k8s-master    <none>           <none>
pod/fn-zookeeper-0                      1/1     Running     0          65m   10.36.0.12   bm-k8s-slave-3   <none>           <none>

NAME                             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
service/fn-apigateway          ClusterIP      10.97.7.233      <none>        8080/TCP,9000/TCP            65m   name=fn-apigateway
service/fn-controller          ClusterIP      10.97.171.152    <none>        8080/TCP                     65m   name=fn-controller
service/fn-couchdb             ClusterIP      10.99.20.83      <none>        5984/TCP                     65m   name=fn-couchdb
service/fn-grafana             ClusterIP      10.109.145.44    <none>        3000/TCP                     65m   name=fn-grafana
service/fn-kafka               ClusterIP      None             <none>        9092/TCP                     65m   name=fn-kafka
service/fn-nginx               LoadBalancer   10.102.125.90    10.99.97.5    80:31887/TCP,443:31425/TCP   65m   name=fn-nginx
service/fn-prometheus-server   ClusterIP      10.103.139.29    <none>        9090/TCP                     65m   name=fn-prometheus-server
service/fn-redis               ClusterIP      10.106.167.149   <none>        6379/TCP                     65m   name=fn-redis
service/fn-user-events         ClusterIP      10.108.197.26    <none>        9095/TCP                     65m   name=fn-user-events
service/fn-zookeeper           ClusterIP      None             <none>        2181/TCP,2888/TCP,3888/TCP   65m   name=fn-zookeeper

NAME                                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE   VOLUMEMODE
persistentvolumeclaim/fn-alarmprovider-pvc       Bound    pvc-9372adbd-473c-47e6-98aa-1be2d9d24468   1Gi        RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-couchdb-pvc             Bound    pvc-120389ef-5906-4965-a715-ffee967abcae   2Gi        RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-kafka-pvc               Bound    pvc-156c82de-36b4-4f68-94cb-49e72698ab51   512Mi      RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-prometheus-pvc          Bound    pvc-391c4265-0896-4e59-9c1f-539b479cbbeb   1Gi        RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-redis-pvc               Bound    pvc-f05dc3f5-602e-49a4-960d-3ddbff668773   256Mi      RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-zookeeper-pvc-data      Bound    pvc-311fe633-1100-4ba9-b4e1-64b73b232617   256Mi      RWO            rook-ceph-block   65m   Filesystem
persistentvolumeclaim/fn-zookeeper-pvc-datalog   Bound    pvc-066dd5e7-e301-48a7-b690-7b4f289df7b4   256Mi      RWO            rook-ceph-block   65m   Filesystem

Nonetheless, we confirm that after switching to calico we could successfully deploy openwhisk in the same environment.