contiv / install

Contiv Installer
https://contiv.github.io
Other
114 stars 56 forks source link

Contiv installer - Intermittent Install failures seen w/ latest 1.1.7 installer bits #340

Open rkharya opened 6 years ago

rkharya commented 6 years ago

Description

v2Plugin installation failures seen multiple times on 2 different setups. There are different error messages for the failure for Contiv master and Contiv worker nodes.

Expected Behavior

Contiv install should succeed on all Master/Worker Nodes w/o any errors.

Observed Behavior

Issue is being seen intermittently but can be stated for sure - After complete clean-up of the Docker Swarm cluster from Contiv bits, first iteration of installation fails then subsequent re-try eventually succeeds in installing Contiv. This behaviour is being seen only with the latest code-changes done some 20 days back on 1.1.7 release. We have not seen this issue during the CVD validation cycle till the CVD was released on Dec'18th, 2017.

Master Node install failures -

TASK [contiv_network : install v2plugin on master nodes] *** fatal: [node2]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.63 control_url=10.65.122.63:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:11.601524", "end": "2018-01-22 15:11:25.034534", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.433010", "stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} fatal: [node1]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.61 control_url=10.65.122.61:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.083192", "end": "2018-01-22 15:11:25.836960", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.753768", "stderr": "Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} fatal: [node3]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.62 control_url=10.65.122.62:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.404043", "end": "2018-01-22 15:11:25.136644", "failed": true, "rc": 1, "start": "2018-01-22 15:05:12.732601", "stderr": "Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} to retry, use: --limit @/ansible/install_plays.retry

PLAY RECAP ***** node1 : ok=17 changed=9 unreachable=0 failed=1 node2 : ok=17 changed=9 unreachable=0 failed=1 node3 : ok=17 changed=9 unreachable=0 failed=1 node4 : ok=9 changed=4 unreachable=0 failed=0 node5 : ok=9 changed=4 unreachable=0 failed=0 node6 : ok=9 changed=4 unreachable=0 failed=0 node7 : ok=9 changed=4 unreachable=0 failed=0 node8 : ok=9 changed=4 unreachable=0 failed=0 node9 : ok=9 changed=4 unreachable=0 failed=0

Worker Node install failures -

TASK [contiv_network : install v2plugin on worker nodes] *** fatal: [node6]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.140 control_url=10.65.121.140:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:51.934836", "end": "2018-01-25 11:38:37.231374", "failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node7]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.141 control_url=10.65.121.141:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.343379", "end": "2018-01-25 11:38:44.770569", "failed": true, "rc": 1, "start": "2018-01-25 11:33:52.427190", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node4]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.142 control_url=10.65.121.142:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.475222", "end": "2018-01-25 11:38:46.382501", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.907279", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node8]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.130 control_url=10.65.121.130:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:54.685860", "end": "2018-01-25 11:38:48.099427", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.413567", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node5]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.143 control_url=10.65.121.143:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:55.817107", "end": "2018-01-25 11:38:49.210135", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.393028", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node12]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.129 control_url=10.65.121.129:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:54.202116", "end": "2018-01-25 11:40:35.330632", "failed": true, "rc": 1, "start": "2018-01-25 11:38:41.128516", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node11]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.128 control_url=10.65.121.128:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:56.424311", "end": "2018-01-25 11:40:43.263658", "failed": true, "rc": 1, "start": "2018-01-25 11:38:46.839347", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} fatal: [node9]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.124 control_url=10.65.121.124:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:02:54.790835", "end": "2018-01-25 11:41:46.656811", "failed": true, "rc": 1, "start": "2018-01-25 11:38:51.865976", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]} changed: [node10]

PLAY RECAP ********************************************************************* node1 : ok=38 changed=19 unreachable=0 failed=0 node10 : ok=23 changed=14 unreachable=0 failed=0 node11 : ok=16 changed=9 unreachable=0 failed=1 node12 : ok=16 changed=9 unreachable=0 failed=1 node2 : ok=37 changed=18 unreachable=0 failed=0 node3 : ok=37 changed=18 unreachable=0 failed=0 node4 : ok=16 changed=9 unreachable=0 failed=1 node5 : ok=16 changed=9 unreachable=0 failed=1 node6 : ok=16 changed=9 unreachable=0 failed=1 node7 : ok=16 changed=9 unreachable=0 failed=1 node8 : ok=16 changed=9 unreachable=0 failed=1 node9 : ok=16 changed=9 unreachable=0 failed=1 ##Worker node failure key error message - `failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}` ##Master node failure key error message - `"stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}` ## Steps to Reproduce (for bugs) 1. Create DEE swarm mode cluster setup with 3 master and couple of worker nodes 2. Download latest Contiv Installer bits version 1.1.7 from Contiv Github Install release location for full install 3. Modify cfg.yml and env.json to suit your cluster environment 4. Issue command for installation - `./install/ansible/install_swarm.sh -f install/ansible/cfg.yml -u root -e ~/.ssh/id_rsa -p` ## Your Environment * netctl version - 1.1.7/v2Plugin * Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm/UCP2.2.4/Docker Engine17.06.2-ee-6 * Operating System and version: RHEL7.3 ##Installation logs are attached herewith - [contiv_install_01-22-2018.09-34-14.UTC.log](https://github.com/contiv/install/files/1662774/contiv_install_01-22-2018.09-34-14.UTC.log) [contiv_install_01-25-2018.05-56-47.UTC.log](https://github.com/contiv/install/files/1662775/contiv_install_01-25-2018.05-56-47.UTC.log)
vhosakot commented 6 years ago

Looking at the attached logs contiv_install_01-22-2018.09-34-14.UTC.log and contiv_install_01-25-2018.05-56-47.UTC.log, I see failures when the contiv docker v2plugin was installed.

The following command failed on both master and worker nodes in the logs:

/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=<IP> control_url=<IP>:9999 vxlan_port=8472 iflist=<interface> plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=[master|worker] fwd_mode=bridge

Can you send the logs in /var/log/contiv/ and /var/log/contiv*.log from the master and worker nodes that saw this issue?

rkharya commented 6 years ago

Worker node install failures - worker nodes don't have /var/log/contiv/ folder or any other contiv logs. So attaching logs from corresponding master nodes in the same cluster - contiv-master-logs-workerfailure.tar.gz

Master node intall failures - (as observed on 2nd cluter) - contiv-master-node-logs.tar.gz

in this case master nodes doesn't have netctl installed, though netplugin booted up cleanly - [root@DEE-Ctrl-1 contiv]# cat plugin_bootup.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-db.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-vswitchd.log Waiting for netmaster to be ready for connections Netmaster ready for connections, setting forward mode to bridge Forward mode is set n-if=eno6 -cluster-store=etcd://localhost:2379 -ctrl-ip=10.65.122.61 /netmaster -plugin-name=contiv/v2plugin:1.1.7 -cluster-mode=swarm-mode -cluster-store=etcd://localhost:2379 -control-url=10.65.122.61:9999

Also docker plugin ls doesn't list Contiv -

[root@DEE-Ctrl-1 contiv]# docker plugin ls ID NAME DESCRIPTION ENABLED 631d379403b4 docker/telemetry:1.0.0.linux-x86_64-stable Docker Inc. metrics exporter false

unclejack commented 6 years ago

@rkharya: Have you reproduced this on CentOS or on another distribution?

rkharya commented 6 years ago

@unclejack: Reproducible on RHEL7.3 environments - BareMetal and BareMetal with VMs