INTERNAL_ERROR when running kubectl commands against cluster deployed with 1.25/candidate snap channel and stable charm.

asbalderson commented 2 years ago

Summary

While testing microk8s 1.25/candidate (v1.25.0-rc.1) using using the stable charm for a 3 unit cluster. All 3 units came up active/idle, but i was unable to run juju add-k8s against the kube.conf from the cluster. After some inspection i was unable to run any commands against the cluster, kubectl --kubeconfig=kube.conf get po for example, returned a Unable to connect to the server: stream error: stream ID 1; INTERNAL_ERROR; received from peer

After inspecting the syslog on the units i saw lots of messages relating to context deadline exceeded for etcd.

Aug 22 18:40:50 microk8s6-1 microk8s.daemon-k8s-dqlite[10987]: time="2022-08-22T18:40:50Z" level=error msg="error while range on /registry/pods/kube-system/calico-kube-controllers-7bf8546cfb-j2rtr : query (try: 0): context deadline exceeded"
Aug 22 18:40:50 microk8s6-1 microk8s.daemon-kubelite[9647]: {"level":"warn","ts":"2022-08-22T18:40:50.923Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001d84000//var/snap/microk8s/3686/var/kubernetes/backend/kine.sock:12379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Aug 22 18:40:50 microk8s6-1 microk8s.daemon-kubelite[9647]: E0822 18:40:50.924876    9647 status.go:71] apiserver received an error that is not an metav1.Status: context.deadlineExceededError{}: context deadline exceeded

What Should Happen Instead?

I would expect that when deploying 1.25 with the stable charm, i would get a working cluster which i could use juju add-k8s or query regular information against the cluster. kubectl get po for example would return the running pods.

Reproduction Steps

deploy a bundle for microk8s, mine is below. I want to note that 10.246.164.2 is our MaaS server which is handling DNS.

applications:
microk8s:
bindings:
  ? ''
  : oam-space
  cluster: internal-space
charm: microk8s
expose: true
num_units: 3
options:
  addons: dns ingress storage
  channel: 1.25/candidate
  containerd_env: "\n                HTTPS_PROXY=http://squid.internal:3128\n\
    \                NO_PROXY=10.1.0.0/16,10.152.183.0/24\n                ulimit\
    \ -n 65536 || true\n                ulimit -l 16384 || true\n            \
    \    "
  coredns_config: "\n                .:53 {\n                    errors\n    \
    \                health {\n                        lameduck 5s\n         \
    \           }\n                    ready\n                    log . {\n  \
    \                      class error\n                    }\n              \
    \      kubernetes cluster.local in-addr.arpa ip6.arpa {\n                \
    \        pods insecure\n                        fallthrough in-addr.arpa ip6.arpa\n\
    \                    }\n                    prometheus :9153\n           \
    \         forward . 10.246.164.2\n                    cache 30\n         \
    \           loop\n                    reload\n                    loadbalance\n\
    \                }\n                "
to:
- '0'
- '1'
- '2'
machines:
'0':
constraints: tags=microk8s,silo3 zones=zone1
'1':
constraints: tags=microk8s,silo3 zones=zone2
'2':
constraints: tags=microk8s,silo3 zones=zone3
relations: []
series: focal

grab the kube.conf from the leader unit juju exec microk8s/leader microk8s config and save it to a file (kube.conf)
Run kubectl --kubeconfig=kube.conf get po

Introspection Report

inspection-report-20220822_191338.tar.gz

asbalderson commented 2 years ago

I should also note, when trying to add-k8s with juju i get the following output

$ KUBECONFIG=/home/ubuntu/project/generated/microk8s/kube.conf juju add-k8s microk8s_cloud --controller foundations-maas
ERROR making juju admin credentials in cluster: ensuring cluster role "juju-credential-bf5f2498" in namespace "kube-system": the server was unable to return a response in the time allotted, but may still be processing the request (get clusterroles.rbac.authorization.k8s.io juju-credential-bf5f2498)

asbalderson commented 2 years ago

attaching logs from other 2 units (0 and 2) inspection-report-20220822_193734.tar.gz inspection-report-20220822_193451.tar.gz

neoaggelos commented 2 years ago

Apologies for missing to reply on this issue.

I was unable to reproduce this issue on any of our development environments. Looking at the error messages, along with loglines filled with timeouts and slow disk ops, I think it might just be transient networking issues instead or resource limits (e.g. open files) instead.

Closing this issue for now, please reopen if this occurs again.

canonical / microk8s