canonical / k8s-snap

Canonical Kubernetes is an opinionated and CNCF conformant Kubernetes operated by Snaps and Charms, which come together to bring simplified operations and an enhanced security posture on any infrastructure.
GNU General Public License v3.0
42 stars 12 forks source link

Getting started - context deadline exceeded #317

Closed VariableDeclared closed 6 months ago

VariableDeclared commented 7 months ago

Please describe the question or issue you're facing with "Getting started - Canonical Kubernetes documentation".

Hello,

Following the Getting Started on an environment connected to the internet, bootstrap fails to create the worker node and pods remain in pending:

root@k8s-test:~# snap install k8s --edge --classic
k8s (edge) v1.29.3 from Canonical✓ installed
root@k8s-test:~# sudo k8s bootstrap
Bootstrapping the cluster. This may take a few seconds, please wait.
Bootstrapped a new Kubernetes cluster with node address "192.168.3.36:6400".
The node will be 'Ready' to host workloads after the CNI is deployed successfully.

root@k8s-test:~# sudo k8s status
status: not ready
high-availability: no
datastore:
  type: k8s-dqlite
  voter-nodes:
    - 192.168.3.36:6400
  standby-nodes: none
  spare-nodes: none
network:
  enabled: true
dns:
  enabled: true
  cluster-domain: cluster.local
  service-ip: 10.152.183.160
  upstream-nameservers:
  - /etc/resolv.conf
ingress:
  enabled: false
  default-tls-secret: ""
  enable-proxy-protocol: false
load-balancer:
  enabled: false
  cidrs: []
  l2-mode: false
  l2-interfaces: []
  bgp-mode: false
  bgp-local-asn: 0
  bgp-peer-address: ""
  bgp-peer-asn: 0
  bgp-peer-port: 0
local-storage:
  enabled: false
  local-path: /var/snap/k8s/common/rawfile-storage
  reclaim-policy: Delete
  set-default: true
gateway:
  enabled: true
metrics-server:
  enabled: true

root@k8s-test:~# sudo k8s kubectl get pods -n kube-system
NAME                               READY   STATUS    RESTARTS   AGE
cilium-operator-5f76fdbf9c-kbllv   0/1     Pending   0          6s
coredns-66579b5b88-mmxzh           0/1     Pending   0          4s
metrics-server-57db9dfb7b-r5mls    0/1     Pending   0          6s
root@k8s-test:~# sudo k8s kubectl get nodes
No resources found
root@k8s-test:~# sudo k8s kubectl get nodes -A
No resources found
root@k8s-test:~# sudo k8s kubectl get nodes -A
No resources found

Could this be due to me running as root? Is this a known limitation?

Thank you, Peter


Reported from: https://documentation.ubuntu.com/canonical-kubernetes/latest/tutorial/getting-started/

VariableDeclared commented 7 months ago

Also getting context deadline exceeded on a VM with 8GiB RAM, 100G disk, 4 vCPUs, with ubuntu user on two different VMs:

ubuntu@k8s-test:~$ sudo k8s bootstrap
Bootstrapping the cluster. This may take a few seconds, please wait.
Error: Failed to bootstrap the cluster.

The error was: failed to bootstrap new cluster using POST /k8sd/cluster: failed to bootstrap new cluster: Post "http://control.socket/cluster/control": context deadline exceeded

image

Removing lxc constraint

bschimke95 commented 6 months ago

Hey Peter,

Thanks for raising this. The VM specs should be fine. You could try to extend the timeout

sudo k8s bootstrap --timeout 10m

but I think the problem is on our side.

Could you add the output of

journalctl -f --lines 2000

(We are working on an inspect script this pulse which automates the collection of debug info)

VariableDeclared commented 6 months ago

hello @bschimke95 ! indeed after using the 10m timeout, bootstrap now passed on one of the nodes. Is there an NVMe requirement on the bootstrap? These VMs are spinner based

bschimke95 commented 6 months ago

Hey Peter,

Not in particular. Also, the suggestion that I provided might just worked by luck. We see this issue as well in #321 and #277. This basically happens because of a internal timeout of microcluster that we cannot work around yet. It is addressed in https://github.com/canonical/microcluster/pull/105.

On our side, we also work towards reducing the overall time that the commands need to finish which eventually also "fixes" this issue. A first effort is done in #339 with a follow-up PR coming soon to make those commands even faster by moving the last pieces into an async approach.

I will close this issue in favour of #321.