NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.41k stars 14.36k forks source link

Kubernetes on arm64: fix coredns #130759

Closed LucaFulchir closed 2 years ago

LucaFulchir commented 3 years ago

Describe the bug Trying to run a test cluster on a raspberrypi4

coredns does not start since the nixpkgs hardcodes a amd64 image

To Reproduce Steps to reproduce the behavior:

  1. try to follow kubernetes deployment: https://nixos.wiki/wiki/Kubernetes
  2. realize you are missing a ton of other things on raspberry, so:
    • disable swap (maybe it's ehough to modify the kubernetes units with MemoryMaxSwap=0? have not tried yet)
    • missing hugetlb, so add the kernel patch:
      kernelPatches = [ {
      name = "HUGETLB";
      patch = null;
      extraConfig = ''
        ARCH_ENABLE_HUGEPAGE_MIGRATION y
        ARCH_WANT_GENERAL_HUGETLB y
        CGROUP_HUGETLB y
        HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD y
        HUGETLBFS y
        HUGETLB_PAGE y
        TRANSPARENT_HUGEPAGE_MADVISE y
            '';
      } ];
    • add to boot.kernelParams "cgroup_enable=memory" "swapaccount=1" "hugepagesz=2M" "hugepages=512"
    • (unknown if needed) add to environtment.systemPackages some additional dependencies for debugging and support kubernetes addons:
      podman
      cri-o
      cri-tools
      ebtables
      ethtool
      socat
      conntrack-tools
      containerd
    • remember to disable virtualisation.podman.enable (docker.enable works), it's not compatible, nixos-rebuild will fail in linking files for etc/cni/net.d/...
    • etcd does not really support ARM, so we need to add:
      systemd.services = {
      "etcd" = {
      environment = {
      ETCD_UNSUPPORTED_ARCH="arm64";
      };
      };
      };
  3. check kubernetes system nodes: kubectl get nodes -A, if everything went fine the node should be Ready. If it isn't, reboot. Somehow kubernetes starts properly only on system boot, otherwise the container network is not set up.
  4. If the node is Ready, check the coredns container logs: kubectl logs --namespace=kube-system -l k8s-app=kube-dns --timestamps -f
    The status should be: standard_init_linux.go:228: exec user process caused: exec format error. This usually means "wrong architecture for this container"
  5. Realize that in services.kubernetes.addons.dns.coredns the image SHA is hardcoded, not dependent on architecture

Expected behavior

Kubernetes services start correctly

Fix found

tell nixos to force a different coredns image (note: do not upgrade to 1.8.X, won't work):

    # use coredns
    addons.dns = {
      enable = true;
      coredns = {
        finalImageTag = "1.7.1";
        imageDigest = "sha256:4a6e0769130686518325b21b0c1d0688b54e7c79244d48e1b15634e98e40c6ef";
        imageName = "coredns/coredns";
        sha256 = "16fx2p48kbq9f5b59wcaad5ypi6gx2qcvihkasn7j3n3ri1cd1d8";
      };
    };

...can we integrate some of this (at lest an arch-dependent coredns) in nixpkg? I would keep the manual ETC_UNSUPPORTED, but it should be better documented in the wiki, too

Notify maintainers

uhm... I can't find references to any maintainers

jali-clarke commented 3 years ago

I was running into this same issue a while back and put up a PR for this: https://github.com/NixOS/nixpkgs/pull/116409/files (merged)

Can you try it out and see if it resolves this issue?

stale[bot] commented 2 years ago

I marked this as stale due to inactivity. → More info

superherointj commented 2 years ago

@LucaFulchir Can you confirm if issue is solved?

LucaFulchir commented 2 years ago

Not anymore, sorry.
I gave up after a bit more tinkering and other problems, plus kubernetes had a bigger overhead than I liked, so I moved to managing rootles podman with a nix config.
Meaning that my raspi4 is in "production" now so I can not test it much.

But jali-clarke is right, that other merge should have fixed this problem, so I'll close this.
Sorry for forgetting about this.

superherointj commented 2 years ago

@LucaFulchir I'm using K3s for Kubernetes Cluster in NixOS on a cluster of 8 Raspberry Pi, it is working great. Works on a single node as well.