Closed jpchev closed 3 years ago
it seems that this line waits forever
why is it that couchdb is not found?
This behaviour is random, sometimes the error doesn't occur. I've reduced the number of k8s nodes to one, and now I can create and call action.
Try kubectl describe pod
on the couch pod to see some details?
kubectl describe pod owdev-couchdb-595b88565-8dtkj -n openwhisk
Name: owdev-couchdb-595b88565-8dtkj
Namespace: openwhisk
Priority: 0
Node: xxx-w1-vm/192.168.1.37
Start Time: Fri, 05 Mar 2021 15:09:45 +0000
Labels: app=owdev-openwhisk
chart=openwhisk-1.0.0
heritage=Helm
name=owdev-couchdb
pod-template-hash=595b88565
release=owdev
Annotations: <none>
Status: Running
IP: 10.44.0.6
IPs:
IP: 10.44.0.6
Controlled By: ReplicaSet/owdev-couchdb-595b88565
Containers:
couchdb:
Container ID: docker://158ea6fce985a5de40a38cf2e21b877f04e2794a5961d5c7e93004b0f31f358c
Image: apache/couchdb:2.3
Image ID: docker-pullable://apache/couchdb@sha256:9f895c8ae371cb895541e53100e039ac6ae5d30f6f0b199e8470d81d523537ad
Port: 5984/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 05 Mar 2021 15:11:09 +0000
Ready: True
Restart Count: 0
Environment:
COUCHDB_USER: <set to the key 'db_username' in secret 'owdev-db.auth'> Optional: false
COUCHDB_PASSWORD: <set to the key 'db_password' in secret 'owdev-db.auth'> Optional: false
NODENAME: couchdb0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-trpk6 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-trpk6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-trpk6
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
openwhisk-role=core:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m43s default-scheduler Successfully assigned openwhisk/owdev-couchdb-595b88565-8dtkj to xxx-w1-vm
Normal Pulling 3m29s kubelet Pulling image "apache/couchdb:2.3"
Normal Pulled 2m28s kubelet Successfully pulled image "apache/couchdb:2.3" in 1m1.078872118s
Normal Created 2m19s kubelet Created container couchdb
Normal Started 2m19s kubelet Started container couchdb
I would investigate whether there is something misconfigured that is resulting in the curl command at https://github.com/apache/openwhisk-deploy-kube/blob/1.0.0/helm/openwhisk/configMapFiles/initCouchDB/initdb.sh#L51
not succeeding. Try doing a kubectl exec
into the pod for the couchdb init job and executing that same curl manually.
from a bash inside the couch init pod, the following command (corresponding to curl --output /dev/null --silent $DB_PROTOCOL://$DB_HOST:$DB_PORT/_utils
) hangs
curl --output /dev/null http://owdev-couchdb.openwhisk.svc.cluster.local:5984/_utils
whereas the same command, using the IP of the couchdb pod works
curl --output /dev/null http://10.44.0.4:5984/_utils
So it seems the IP of the couchdb service (which corresponds to owdev-couchdb.openwhisk.svc.cluster.local) doesn't reach the pod behind.
This seems to happen randomly, after installing openwhisk: sometimes the couchdb init pod completes (after that I have other availability problems when creating functions, but that's for another issue...).
I'm not sure, but I would say this issue doesn't happen when the couch db pod and the couch db init pod get deployed on the same Kubernetes node.
I've checked rules in iptables for the couchdb services, it seems the routing is in place:
sudo iptables-save | grep couchdb
-A KUBE-SEP-SNMSYAE7RYOIOXBL -s 10.44.0.4/32 -m comment --comment "openwhisk/owdev-couchdb:couchdb" -j KUBE-MARK-MASQ
-A KUBE-SEP-SNMSYAE7RYOIOXBL -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb" -m tcp -j DNAT --to-destination 10.44.0.4:5984
-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.99.12.131/32 -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb cluster IP" -m tcp --dport 5984 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.99.12.131/32 -p tcp -m comment --comment "openwhisk/owdev-couchdb:couchdb cluster IP" -m tcp --dport 5984 -j KUBE-SVC-4DTNEZVJQQUBR77C
-A KUBE-SVC-4DTNEZVJQQUBR77C -m comment --comment "openwhisk/owdev-couchdb:couchdb" -j KUBE-SEP-SNMSYAE7RYOIOXBL
and the endpoints are in place
kubectl get endpoints -n openwhisk
NAME ENDPOINTS AGE
owdev-apigateway 10.44.0.5:9000,10.44.0.5:8080 29m
owdev-controller 29m
owdev-couchdb 10.44.0.4:5984 29m
owdev-kafka 10.36.0.3:9092 29m
owdev-nginx 29m
owdev-redis 10.36.0.1:6379 29m
owdev-zookeeper 10.44.0.7:2181,10.44.0.7:2888,10.44.0.7:3888 29m
Do you know where to look into, to have the couchdb cluster IP service be routed to the backing pod?
Do you have any idea?
Thanks
it seems the problem got fixed after I replaced the CNI layer of Kubernetes: I used Weave, and I switched to Calico
Strange how this is an issue for openwhisk
namespace only. We faced this issue and found out that init-db
job could not connect to couchdb.openshisk.svc.cluster.local
, and curl http://couchdb.openshisk.svc.cluster.local:5984
timedout when run from another pod like wskadm
. We could resolve the dns properly, but connections to couchdb failed. If we deploy openwhisk using helm charts on kubernetes (own setup on VMs) it requires other services like dynamic volumes (which we provide using rook or NFS), and those services do not face issue with service communication. I guess we need to look into this deeper, as replacing weave with calico might not always be the option.
Is the port open/accessible?
Yes, they were accessible in a way - Internally in their pod, and between pods on the same node. Sharing a case below, where we were running K8s 1.18.18
with weave 2.8.1
on CentOS7 3.10.0-1160.24.1
.
After reproducing this issue, We went back to square one, where this might not be an issue with Openwhisk deployment or ow namespace only.
Sharing a case below, where we can connect (telnet) to pods and where we cannot,
[root@rook-ceph-tools-9wbw2 /]# nslookup rook-ceph-mgr.rook-ceph.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: rook-ceph-mgr.rook-ceph.svc.cluster.local
Address: 10.101.148.210
[root@rook-ceph-tools-9wbw2 /]# telnet rook-ceph-mgr.rook-ceph.svc.cluster.local 9283
Trying 10.101.148.210...
Connected to rook-ceph-mgr.rook-ceph.svc.cluster.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@rook-ceph-tools-9wbw2 /]# nslookup fn-couchdb.fn.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: fn-couchdb.fn.svc.cluster.local
Address: 10.99.20.83
[root@rook-ceph-tools-9wbw2 /]# telnet fn-couchdb.fn.svc.cluster.local 5984
Trying 10.99.20.83...
^C
root@fn-couchdb-848f8bb7c9-dswkt:/# cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.32.0.14 fn-couchdb-848f8bb7c9-dswkt
root@fn-couchdb-848f8bb7c9-dswkt:/# netstat -tupln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:5984 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:9100 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:4369 0.0.0.0:* LISTEN -
tcp6 0 0 :::4369 :::* LISTEN -
root@fn-couchdb-848f8bb7c9-dswkt:/# telnet localhost 5984
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@enfn-wskadmin:/# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
root@enfn-wskadmin:/# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:25:26-- http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 442 [application/json]
Remote file exists.
Success: CouchDB is ready!
root@enfn-wskadmin:/# telnet enfn-couchdb.fn.svc.cluster.local 5984
Trying 10.99.20.83...
Connected to enfn-couchdb.fn.svc.cluster.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@enfn-wskadmin:/#
root@enfn-couchdb-848f8bb7c9-dswkt:/# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
root@enfn-couchdb-848f8bb7c9-dswkt:/# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:23:40-- http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 442 [application/json]
Remote file exists.
Success: CouchDB is ready!
[root@rook-ceph-tools-9wbw2 /]# READINESS_URL=http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker [root@rook-ceph-tools-9wbw2 /]# while true; do echo 'checking CouchDB readiness'; wget -T 5 --spider $READINESS_URL --header="Authorization: Basic d2hpc2tfYWRtaW46TVRreU9UWXpNelJp"; result=$?; if [ $result -eq 0 ]; then echo 'Success: CouchDB is ready!'; break; fi; echo '...not ready yet; sleeping 3 seconds before retry'; sleep 3; done;
checking CouchDB readiness
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:21-- http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Resolving enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)... 10.99.20.83
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... failed: Connection timed out.
Retrying.
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:27-- (try: 2) http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... failed: Connection timed out.
Retrying.
Spider mode enabled. Check if remote file exists.
--2021-04-27 06:24:34-- (try: 3) http://enfn-couchdb.fn.svc.cluster.local:5984/ow_kube_couchdb_initialized_marker
Connecting to enfn-couchdb.fn.svc.cluster.local (enfn-couchdb.fn.svc.cluster.local)|10.99.20.83|:5984... ^C
In This case above, whiskadmin and couchdb pods were on the same node. whereas in other cases they were not (might have not have been on the same node)
We will look more into this as time permits and update accordingly.
Took some time for me to get back and recreate the issue. Our test env was made up of three nodes,
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME LABELS
bm-k8s-master Ready master 22h v1.18.18 10.99.97.118 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=,openwhisk-role=invoker
bm-k8s-slave-2 Ready <none> 18h v1.18.18 10.99.97.116 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-slave-2,kubernetes.io/os=linux,openwhisk-role=invoker
bm-k8s-slave-3 Ready <none> 18h v1.18.18 10.99.97.115 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.15 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=bm-k8s-slave-3,kubernetes.io/os=linux,openwhisk-role=invoker
Openwhisk landscape looks as follows,
kubectl -n fn get pod,svc,pvc -o wide Tue Apr 27 11:47:19 2021
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/fn-alarmprovider-5b9c7b9b8d-sh8nm 0/1 Init:0/1 0 65m 10.36.0.10 bm-k8s-slave-3 <none> <none>
pod/fn-apigateway-74b487d8cb-hn5sb 1/1 Running 0 65m 10.44.0.8 bm-k8s-slave-2 <none> <none>
pod/fn-controller-0 0/1 Init:1/2 0 65m 10.44.0.9 bm-k8s-slave-2 <none> <none>
pod/fn-couchdb-848f8bb7c9-dswkt 1/1 Running 0 65m 10.32.0.14 bm-k8s-master <none> <none>
pod/fn-grafana-78dc6fcdff-smvm8 1/1 Running 0 65m 10.36.0.9 bm-k8s-slave-3 <none> <none>
pod/fn-init-couchdb-fjn8s 0/1 Completed 0 65m 10.32.0.10 bm-k8s-master <none> <none>
pod/fn-install-packages-qgplt 0/1 Init:0/1 0 65m 10.32.0.12 bm-k8s-master <none> <none>
pod/fn-invoker-0 0/1 Init:0/1 0 65m 10.32.0.13 bm-k8s-master <none> <none>
pod/fn-kafka-0 1/1 Running 0 65m 10.44.0.10 bm-k8s-slave-2 <none> <none>
pod/fn-nginx-5d7f747b95-25f7q 0/1 Init:0/1 0 65m 10.44.0.7 bm-k8s-slave-2 <none> <none>
pod/fn-prometheus-server-0 1/1 Running 0 65m 10.36.0.11 bm-k8s-slave-3 <none> <none>
pod/fn-redis-6d9f5f56b5-5gbrq 1/1 Running 0 65m 10.32.0.15 bm-k8s-master <none> <none>
pod/fn-user-events-7bf9665968-rpgsl 1/1 Running 1 65m 10.32.0.11 bm-k8s-master <none> <none>
pod/fn-wskadmin 1/1 Running 0 65m 10.32.0.9 bm-k8s-master <none> <none>
pod/fn-zookeeper-0 1/1 Running 0 65m 10.36.0.12 bm-k8s-slave-3 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/fn-apigateway ClusterIP 10.97.7.233 <none> 8080/TCP,9000/TCP 65m name=fn-apigateway
service/fn-controller ClusterIP 10.97.171.152 <none> 8080/TCP 65m name=fn-controller
service/fn-couchdb ClusterIP 10.99.20.83 <none> 5984/TCP 65m name=fn-couchdb
service/fn-grafana ClusterIP 10.109.145.44 <none> 3000/TCP 65m name=fn-grafana
service/fn-kafka ClusterIP None <none> 9092/TCP 65m name=fn-kafka
service/fn-nginx LoadBalancer 10.102.125.90 10.99.97.5 80:31887/TCP,443:31425/TCP 65m name=fn-nginx
service/fn-prometheus-server ClusterIP 10.103.139.29 <none> 9090/TCP 65m name=fn-prometheus-server
service/fn-redis ClusterIP 10.106.167.149 <none> 6379/TCP 65m name=fn-redis
service/fn-user-events ClusterIP 10.108.197.26 <none> 9095/TCP 65m name=fn-user-events
service/fn-zookeeper ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 65m name=fn-zookeeper
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE
persistentvolumeclaim/fn-alarmprovider-pvc Bound pvc-9372adbd-473c-47e6-98aa-1be2d9d24468 1Gi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-couchdb-pvc Bound pvc-120389ef-5906-4965-a715-ffee967abcae 2Gi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-kafka-pvc Bound pvc-156c82de-36b4-4f68-94cb-49e72698ab51 512Mi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-prometheus-pvc Bound pvc-391c4265-0896-4e59-9c1f-539b479cbbeb 1Gi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-redis-pvc Bound pvc-f05dc3f5-602e-49a4-960d-3ddbff668773 256Mi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-zookeeper-pvc-data Bound pvc-311fe633-1100-4ba9-b4e1-64b73b232617 256Mi RWO rook-ceph-block 65m Filesystem
persistentvolumeclaim/fn-zookeeper-pvc-datalog Bound pvc-066dd5e7-e301-48a7-b690-7b4f289df7b4 256Mi RWO rook-ceph-block 65m Filesystem
Nonetheless, we confirm that after switching to calico we could successfully deploy openwhisk in the same environment.
hello, I'm struggling with deploying Openwhisk on a Kubernetes cluster. It seems the owdev-init-couchdb pod doesn't complete as it waits for couchdb
The last line of
kubectl describe pod owdev-controller-0 -n openwhisk
saysNormal Started 3m23s kubelet Started container wait-for-couchdb
I disabled the persistence in mycluster.yaml