Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs

sisudhir commented 6 years ago

Description

In a mixed swarm mode cluster (baremetal and VMs) with Contiv 1.1.7, docker service scale issues are seen on rebooting the worker VMs. Before the reboot the cluster had the containers running on all the nodes (baremetal and VMs) using Contiv network and policy framework.

Expected Behavior

The VM reboot should not affect the performance with Contiv network.

Observed Behavior

On rebooting the VMs that were running containers, the containers moved successfully on the surviving worker nodes. But the Docker service scale takes unusually long time. Also, connection errors are seen in netmaster log as: Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying..

Steps to Reproduce (for bugs)

Created DEE 17.06 cluster in swarm mode with mixed topology - baremetal and VM worker nodes. Master nodes are on baremetal and worker nodes are on VMs.
Installed Contiv 1.1.7 and created back-end Contiv network and policies. Applied policies via group with Contiv tag and created corresponding Docker network
Created Docker service using Contiv network as backend and checked network endpoint connectivity between them and SVIs. All working as expected.
Rebooted 2 worker VMs, containers running on them moved successfully to surviving nodes.
Tried scalling same Docker service to add 5 more containers on the same Contiv network.
Service scale took unusually long, more than 30 minutes to complete for adding 5 more containers.
Saw connections errors to rebooted worker VMs in netmaster logs.

Your Environment

netctl version - 1.1.7/v2Plugin
Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm 17.06/UCP-2.3*
Operating System and version: RHEL7.3
Contiv Data Path: physical vNIC exposed by ESXi on worker VMs in pass-through mode contiv-logs.tar.gz

vhosakot commented 6 years ago

Looking at the logs in contiv-logs.tar.gz, looks like an RPC issue when netmaster connects to Ofnet:

netmaster.log has:

time="Jan 18 08:36:21.576831134" level=warning msg="Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying.."
time="Jan 18 08:36:22.578994895" level=error msg="Failed to connect to Rpc server 10.65.121.129:9002"
time="Jan 18 08:36:22.579084442" level=error msg="Error calling RPC: OfnetAgent.AddMaster. Could not connect to server"
time="Jan 18 08:36:22.579133952" level=error msg="Error calling AddMaster rpc call on node {10.65.121.129 9002}. Err: Could not connect to server"
time="Jan 18 08:36:22.579152875" level=error msg="Error adding node {10.65.121.129 9002}. Err: Could not connect to server"

Can you send the docker daemon's logs when you see this issue?

blaksmit commented 6 years ago

On today's call, there was an ask to see if this is an issue on K8s as well or just in Docker Swarm mode.

vhosakot commented 6 years ago

@blaksmit This issue is seen Docker Swarm mode. Pretty sure that this issue cannot be seen in k8s as k8s does not even have the docker service scale command that exposes this issue.

blaksmit commented 6 years ago

@vhosakot the comment was to see whether a similar VM scale issue is seen with K8s.

vhosakot commented 6 years ago

@blaksmit I see, got it. We could test if this issue is seen when kubectl scale is done (k8s equivalent of docker service scale).

sisudhir commented 6 years ago

Please note the changed title. This is not just a scale issue as we are seeing this issue with deploying a new service having just a single container.

g1rana commented 6 years ago

@sisudhir , Is this issue is seen at every iteration of your failure test ? Is it possible for you to share your setup with me . I can take a look at setup during error times

contiv / netplugin