Compose on K8 Crashes when Azure Kubernetes (AKS) Node Restarts

goodinfoconsulting commented 5 years ago

We followed the blog here to install Compose-on-Kubernetes on a 1 Node Azure AKS Cluster https://github.com/docker/compose-on-kubernetes/blob/master/docs/install-on-aks.md

We've followed these instructions to the T, including ensuring that we install an etc-d cluster separate from the default etc-d instance that comes with K8.

Everything works great on first install.

As advertised, we are able to run docker stack deploy successfully, and deploy our containers and services using our compose YAML files.

Problem

However, when we restart the AKS Node, the Compose and Compose API deployments fail to start with the following errors:

Compose

Liveness probe failed: Get http://10.240.0.27:8080/healthz: dial tcp 10.240.0.27:8080: connect: connection refused

Compose API

Liveness probe failed: Get http://10.240.0.11:8080/healthz: dial tcp 10.240.0.11:8080: connect: connection refused
Back-off restarting failed container

The pods fail to start, with the following error:

Waiting: CrashLoopBackOff

Deleting the pods does not help. New Pods throw the same error.

Also, trying to run any docker stack command when the compose containers are in this state throws the following error:

$ docker stack ls --orchestrator=kubernetes

the server is currently unable to handle the request (get stacks.compose.docker.com)

Deleting and re-installing the compose api using installer-windows.exe -namespace=compose -etcd-servers=http://compose-etcd-client:2379 -tag=v0.4.18 gets the pods to start again, but the service remains broken -- throwing the previous error the server is currently unable to handle the request (get stacks.compose.docker.com) when we run any docker stack command. This is despite all pods and deployment now being in a green state.

In short. Restarting the AKS Node completely breaks the Compose API.

The only way we've found so far to restore the API is to completely delete the AKS cluster and create a new one. Not a tenable production solution.

Expected behavior:

Restarting AKS nodes should bring all components of Compose on Kubernetes back online, automatically, and Developers should be able to run docker stack as soon as the node is back online - without further interventions.

simonferquel commented 5 years ago

I'll do some test. I think the etcd instance restart might cause some issues.

goodinfoconsulting commented 5 years ago

Hi @simonferquel , any updates on this?

simonferquel commented 5 years ago

I did not have time yet to investigate the issue, I plan to do that early next week.

diabloxenon commented 5 years ago

After following the installation guide exactly from: https://github.com/docker/compose-on-kubernetes/blob/master/docs/install-on-microk8s.md The same problem has been happening in Microk8s also.

simonferquel commented 5 years ago

I just had a look at it, and have rooted the issue:

Default etcd-operator configuration uses EmptyDir instead of your default StorageProvider as persistent storage driver. I suppose it is for making it as fast as possible to deploy test clusters, but default configuration should be tweaked a bit.
There are 2 things you can do to make things better:
- https://github.com/coreos/etcd-operator/blob/master/doc/user/spec_examples.md#custom-persistentvolumeclaim-definition: explicitly set a persistent volume claim spec (usign either "default" or "managed-premium" for specifying either a standard azure disk, or a managed premium azure disk)
- If you have at least 3 nodes, specify an anti-affinity to make sure each etcd member runs on a separate node: https://github.com/coreos/etcd-operator/blob/master/doc/user/spec_examples.md#custom-persistentvolumeclaim-definition

Additionally, for production readyness, it is strongly recommended that you sue mutual TLS to connect to the ETCD, as described here: https://github.com/coreos/etcd-operator/blob/master/doc/user/cluster_tls.md

simonferquel commented 5 years ago

I will write a PR about ETCD-operator proper use.

docker / compose-on-kubernetes

Compose on K8 Crashes when Azure Kubernetes (AKS) Node Restarts #71