Cannot deploy on kind because of couchdb-init failure

grussorusso commented 3 years ago

I am trying to deploy OpenWhisk on my Arch Linux machine using kind. I have 2 worker nodes in the cluster and I have labelled them according to the official guide. I deploy OW using the official heml chart.

This is the output of kubectl get pods -n openwhisk:

NAME                                   READY   STATUS      RESTARTS   AGE
owdev-alarmprovider-5b86cb64ff-rm498   0/1     Init:0/1    0          22m
owdev-apigateway-bccbbcd67-pd79z       1/1     Running     0          22m
owdev-controller-0                     0/1     Init:1/2    0          22m
owdev-couchdb-584676b956-vctzv         1/1     Running     0          22m
owdev-gen-certs-xmxh7                  0/1     Completed   0          22m
owdev-init-couchdb-7fwnl               0/1     Error       0          20m
owdev-init-couchdb-d5dsv               0/1     Error       0          17m
owdev-init-couchdb-sqqhp               0/1     Error       0          22m
owdev-init-couchdb-wmcfj               0/1     Error       0          19m
owdev-install-packages-hqsg5           0/1     Init:0/1    0          22m
owdev-invoker-0                        0/1     Init:0/1    0          22m
owdev-kafka-0                          1/1     Running     0          22m
owdev-kafkaprovider-5574d4bf5f-ghdtk   0/1     Init:0/1    0          22m
owdev-nginx-86749d59cb-54c6l           0/1     Init:0/1    0          22m
owdev-redis-d65649c5b-xg6gh            1/1     Running     0          22m
owdev-wskadmin                         1/1     Running     0          22m
owdev-zookeeper-0                      1/1     Running     0          22m

and kubectl logs -n openwhisk owdev-init-couchdb-sqqhp:

Cloning into '/openwhisk'...
/openwhisk /
Note: checking out '1.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 2c621c07 fix start.sh to work on macos (#5019)
/
/openwhisk/ansible /
 [WARNING]: Unable to parse /openwhisk/ansible/environments/local as an
inventory source
 [WARNING]: No inventory was parsed, only implicit localhost is available
 [WARNING]: provided hosts list is empty, only localhost is available. Note
that the implicit localhost does not match 'all'

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
Friday 12 February 2021  14:41:10 +0000 (0:00:00.120)       0:00:00.120 ******* 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TimeoutError: Timer expired after 60 seconds
fatal: [localhost]: FAILED! => {"changed": false, "cmd": "/bin/findmnt --list --noheadings --notruncate", "msg": "Timer expired after 60 seconds", "rc": 257}

[FAILED]
> /bin/findmnt --list --noheadings --notruncate
Timer expired after 60 seconds

PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1   

Friday 12 February 2021  14:42:10 +0000 (0:01:00.469)       0:01:00.589 ******* 
=============================================================================== 
Gathering Facts -------------------------------------------------------- 60.47s

I verified that the issue does not appear when using Minikube (using both Docker and containerd as container runtime), so I think the issue is somehow related to kind.

My whisk.yml configuration is identical to that shown in the guide for deploying OW on kind (except for the apiHostName, which I set as indicated).

Thanks in advance for any hint

dgrove-oss commented 3 years ago

hi. What Kubernetes version (v1.18, v1.17, etc). are you running using kind? Our automated testing is currently covering v1.16, 1.17, and 1.18. We need to enable testing for 1.19 and 1.20 in travis-ci, but haven't gotten around to it yet...

grussorusso commented 3 years ago

Thanks for your reply. Indeed, kind automatically picked v1.20. Unfortunately, I get the same error with v1.18.15.

dgrove-oss commented 3 years ago

It worked for me last night using kind 0.10 on MacOS Docker Desktop (aka my laptop) and Kubernetes v1.18.5. But I realized that I deployed the latest chart from git, not the 1.0.0 chart from the helm repo. I will try that later tonight just to make sure it isn't some problem with the chart itself.

dgrove-oss commented 3 years ago

Probably not surprising, but on my MacOS / Docker Desktop, installing the 1.0.0 helm chart on kind 0.10 works. Here's the beginning snippet of the log from the init-couchdb job.

Daves-MacBook-Pro:kar dgrove$ kubectl logs jobs/owdev-init-couchdb -n openwhisk
Cloning into '/openwhisk'...
/openwhisk /
Note: checking out '1.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 2c621c07 fix start.sh to work on macos (#5019)
/
/openwhisk/ansible /
 [WARNING]: Unable to parse /openwhisk/ansible/environments/local as an
inventory source
 [WARNING]: No inventory was parsed, only implicit localhost is available
 [WARNING]: provided hosts list is empty, only localhost is available. Note
that the implicit localhost does not match 'all'

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
Wednesday 17 February 2021  01:37:08 +0000 (0:00:00.175)       0:00:00.176 **** 
ok: [localhost]

TASK [gen hosts if 'local' env is used] ****************************************
Wednesday 17 February 2021  01:37:09 +0000 (0:00:01.093)       0:00:01.269 **** 
changed: [localhost -> localhost]

TASK [find the ip of docker-machine] *******************************************
Wednesday 17 February 2021  01:37:09 +0000 (0:00:00.752)       0:00:02.022 **** 
skipping: [localhost]

TASK [get the docker-machine ip] ***********************************************
Wednesday 17 February 2021  01:37:09 +0000 (0:00:00.053)       0:00:02.076 **** 
skipping: [localhost]

TASK [gen hosts for docker-machine] ********************************************
Wednesday 17 February 2021  01:37:10 +0000 (0:00:00.068)       0:00:02.144 **** 
skipping: [localhost]

TASK [gen hosts for Jenkins] ***************************************************
Wednesday 17 February 2021  01:37:10 +0000 (0:00:00.082)       0:00:02.226 **** 
skipping: [localhost]

TASK [check if db_local.ini exists?] *******************************************
Wednesday 17 February 2021  01:37:10 +0000 (0:00:00.084)       0:00:02.311 **** 
ok: [localhost]

I'm not sure exactly what ansible does in its initial gathering facts stage, but probably the thing to do is to try to run that pod interactively, execute the commands manually, and see if you can get a better error message.

grussorusso commented 3 years ago

Thanks again for checking. I followed your suggestion and tried executing the command on which Ansible fails (/bin/findmnt --list --noheadings --notruncate) on the container via kubectl exec. And it works without issues... However, the pod eventually enters the Failed state with the same output.

At this point, I am even more confused about the problem. It is probably related to my own environment (maybe Docker version? I am using Docker 20.10.3 on Linux). I will verify if the same thing happens on a different Linux machine, as soon as I have some time to do so. Anyway, although annoying, the issue is not blocking for me as I managed to deploy OpenWhisk on Minikube.

grussorusso commented 3 years ago

I confirm everything works on a different Linux machine. So the issue is caused by something in my own configuration, although I haven't realized what exactly.

s117 commented 3 years ago

Thanks again for checking. I followed your suggestion and tried executing the command on which Ansible fails (/bin/findmnt --list --noheadings --notruncate) on the container via kubectl exec. And it works without issues... However, the pod eventually enters the Failed state with the same output.

At this point, I am even more confused about the problem. It is probably related to my own environment (maybe Docker version? I am using Docker 20.10.3 on Linux). I will verify if the same thing happens on a different Linux machine, as soon as I have some time to do so. Anyway, although annoying, the issue is not blocking for me as I managed to deploy OpenWhisk on Minikube.

I just came across the same problem. It turns the timeout is caused by Python rather than /bin/findmnt. Below are some related upstream tickets: https://github.com/ansible/ansible/issues/24228#issuecomment-409693926 https://bugs.python.org/issue1663329 https://bugs.python.org/issue11284

My system runs Arch Linux also, and inside the container ulimit -n returns a large value. My workaround is to modify helm/openwhisk/configMapFiles/initCouchDB/initdb.sh to apply this patch to /usr/local/lib/python2.7/dist-packages/ansible/plugins/shell/__init__.py before using ansible.

Reylak commented 1 year ago

It seems that the "bug" is also to use Python 2: OpenWhisk uses a Docker image of CouchDB 2.3.1, that is based on Debian Buster (slim) where the only variant of Python is Python 2...

I won't try to run OpenWhisk with CouchDB 3 because I have no idea what that would imply.

However, it means another valid workaround is to "fix" the environment where Ansible run (i.e., the CouchDB container when initializing the DB) by adding ulimit -n 4096 to the init script at "helm/openwhisk/configMapFiles/initCouchDB/initdb.sh". Tested and approved on roughly up-to-date Arch Linux.

Do you think this could be a valid fix to this problem, that could be merged? As a way to deal with a wart from the obsolete Python 2. I find it cleaner, clearer, and easier to implement than to patch some Ansible plugin file.

rabbah commented 1 year ago

If still using Python 2, it's def better to move to v3 instead.

Reylak commented 1 year ago

Sure, but as I said, using Python 2 comes from using CouchDB 2.3.1 Docker image. I don't know if OpenWhisk can work with v3, which hopefully is based on a more up-to-date Debian image.

Samxamnom commented 1 year ago

However, it means another valid workaround is to "fix" the environment where Ansible run (i.e., the CouchDB container when initializing the DB) by adding ulimit -n 4096 to the init script at "helm/openwhisk/configMapFiles/initCouchDB/initdb.sh". Tested and approved on roughly up-to-date Arch Linux.

Do you think this could be a valid fix to this problem, that could be merged? As a way to deal with a wart from the obsolete Python 2. I find it cleaner, clearer, and easier to implement than to patch some Ansible plugin file

I came across the same issue on Arch Linux, this fix also worked for me.

apache / openwhisk-deploy-kube

Cannot deploy on kind because of couchdb-init failure #673