akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
38 stars 4 forks source link

k3s production use considerations (and validation) #217

Closed andy108369 closed 4 months ago

andy108369 commented 7 months ago

@chainzero created the k3s method of provider installation, described here https://akashengineers.xyz/provider-build-scripts

Before getting this to the Production use the following points must be considered, addressed/verified to be supported with the k3s K8s cluster deployment method:

Additioanlly/Ideally

jigar-arc10 commented 6 months ago

Here is what we found so far from our testing.

We will continue testing further and will report new findings.

chainzero commented 6 months ago

@jigar-arc10 - thank you for the additional testing.

Thoughts on some of the points raised above:

Current Akash Provider documentation and install process assumes install is being run as root as stated here:

https://akash.network/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/kubernetes-cluster-for-akash-providers/#step-2---install-ansible

As this is part of pre-existing methodologies - do not view this as an issue - but please let us know if you feel otherwise and/or if it will provoke issues in Praetor use.

Current Akash Provider > Helm install based instructions recommend/assume Ubuntu use as stated here:

https://akash.network/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/kubernetes-cluster-for-akash-providers/#kubernetes-cluster-softwarehardware-requirements-and-recommendations

Based on this being part of the pre-existing standard - do not believe this is an issue but please let us know if you feel otherwise and/or if this may cause issues for Praetor users.

Will look into this issue further. Initial testing of scaling down procedure only tested the ability to scale down K3s nodes. Have not yet tested scaling down with Akash provider and related operators installed. Will test those scenarios ASAP.

jigar-arc10 commented 6 months ago

@chainzero - Thanks for the response.

As this is part of pre-existing methodologies - do not view this as an issue - but please let us know if you feel otherwise and/or if it will provoke issues in Praetor use.

After deep consideration, we agree that root user access should be required as it also helps with GPU driver installation steps.

Based on this being part of the pre-existing standard - do not believe this is an issue but please let us know if you feel otherwise and/or if this may cause issues for Praetor users.

It's a non-issue.

Will look into this issue further. Initial testing of scaling down procedure only tested the ability to scale down K3s nodes. Have not yet tested scaling down with Akash provider and related operators installed. Will test those scenarios ASAP.

After many iterations of testing regarding node removal with updated scripts, the issue about operator-inventory-hardware is gone, and the node was successfully removed.

devalpatel67 commented 4 months ago

Here are the considerations which can be while using k3s instead of k8s.

CNI plugins/calico

CNI plugins/calico: Consider installation scenario where one would want to specify K8s internal networking as well, primarily for the performance sake (for internal K8s services/apps communication, including Rook-Ceph persistent storage which can be really heavy on the traffic if it is not done via the internal networking which will lead to significant performance lag and bill if provider's traffic is metered)

In the K3S setup, we use the default Calico CNI plugin provided by k3s to ensure high performance for internal networking. This configuration is essential for optimizing communication between Kubernetes services and applications, especially for high-traffic services like Rook-Ceph, to prevent significant performance lag and avoid metered external traffic costs.

  1. We verify that Calico is installed and running in our k3s cluster.

    root@node1:~# kubectl get pods -n kube-system -l k8s-app=calico-node
    NAME                READY   STATUS    RESTARTS   AGE
    calico-node-plt4k   1/1     Running   0          4h57m
  2. To define an IP pool for internal networking and ensure efficient internal communication, we use the following configuration:

    root@node1:~# kubectl get ippool
    NAME                  AGE
    default-ipv4-ippool   9h
    
    root@node1:~# kubectl describe ippool default-ipv4-ippool
    Name:         default-ipv4-ippool
    Namespace:
    Labels:       <none>
    Annotations:  projectcalico.org/metadata: {"uid":"cf9f2f1f-c77e-463e-8574-d9b6ea72d055","creationTimestamp":"2024-07-16T16:56:14Z"}
    API Version:  crd.projectcalico.org/v1
    Kind:         IPPool
    Metadata:
     Creation Timestamp:  2024-07-16T16:56:14Z
     Generation:          1
     Resource Version:    712
     UID:                 b3def60d-9f8b-46d8-9ff8-42c1de61412a
    Spec:
     Allowed Uses:
       Workload
       Tunnel
     Block Size:     26
     Cidr:           192.168.0.0/16
     Ipip Mode:      Always
     Nat Outgoing:   true
     Node Selector:  all()
     Vxlan Mode:     Never
    Events:           <none>
  3. Define Network Policies (If needed) We create network policies to manage traffic flow and ensure internal communication is optimized for performance.

    kubectl apply -f - <<EOF
    apiVersion: projectcalico.org/v3
    kind: NetworkPolicy
    metadata:
     name: allow-rook-ceph
     namespace: rook-ceph
    spec:
     selector: all()
     ingress:
     - action: Allow
       source:
         namespaceSelector: has(role)
         selector: app == 'rook-ceph'
     egress:
     - action: Allow
       destination:
         namespaceSelector: has(role)
         selector: app == 'rook-ceph'
    EOF

customize nodefs & imagefs locations

customize nodefs & imagefs locations: similarly to how it's described here

To manage storage effectively, we can customize the locations for nodefs and imagefs in k3s. This involves setting custom data directories and configuring containerd, the container runtime used by k3s.

At this point we imagine, we created RAID0 over 2 NVME using the following commands:

root@node1:~# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0 63.9M  1 loop /snap/core20/2318
loop1          7:1    0 25.2M  1 loop /snap/amazon-ssm-agent/7993
loop2          7:2    0   87M  1 loop /snap/lxd/28373
loop3          7:3    0 55.7M  1 loop /snap/core18/2829
loop4          7:4    0 38.8M  1 loop /snap/snapd/21759
nvme0n1      259:0    0   80G  0 disk
├─nvme0n1p1  259:1    0 79.9G  0 part /
├─nvme0n1p14 259:2    0    4M  0 part
└─nvme0n1p15 259:3    0  106M  0 part /boot/efi
nvme1n1      259:4    0  100G  0 disk
nvme2n1      259:5    0  100G  0 disk

root@node1:~# mdadm --create /dev/md0 --level=raid0 --metadata=1.2 --raid-devices=2 /dev/nvme1n1 /dev/nvme2n1
mdadm: array /dev/md0 started.

root@node1:~# cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 nvme2n1[1] nvme1n1[0]
      209582080 blocks super 1.2 512k chunks
unused devices: <none>

root@node1:~# mkfs.ext4 /dev/md0
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 52395520 4k blocks and 13099008 inodes
Filesystem UUID: b1ea6725-0d38-42d2-a9c8-3071d8c7c5de
Superblock backups stored on blocks:
  32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
  4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

root@node1:~# cp -p /etc/fstab /etc/fstab.1

root@node1:~# cat >> /etc/fstab << EOF
UUID="$(blkid /dev/md0 -s UUID -o value)"  /data        ext4   defaults,discard  0 0
EOF

root@node1:~# diff -Nur /etc/fstab.1 /etc/fstab
--- /etc/fstab.1    2024-07-01 15:42:56.210521795 +0000
+++ /etc/fstab  2024-07-17 04:07:18.985153190 +0000
@@ -1,2 +1,3 @@
LABEL=cloudimg-rootfs   /    ext4   discard,errors=remount-ro   0 1
LABEL=UEFI  /boot/efi   vfat    umask=0077  0 1
+UUID="28b606d9-6e43-4a0b-be60-c7cda95b71e4"  /data        ext4   defaults,discard  0 0

root@node1:~# mkdir /data
mount /data

root@node1:~# df -Ph /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        196G   28K  186G   1% /data

root@node1:~# /usr/share/mdadm/mkconf > /etc/mdadm/mdadm.conf

root@node1:~# cat /etc/mdadm/mdadm.conf | grep -v ^\#
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md/0  metadata=1.2 UUID=1e921d7f:4b06d544:42f0e25f:a252e4e1 name=ip-172-31-47-75:0

root@node1:~# update-initramfs -c -k all
update-initramfs: Generating /boot/initrd.img-6.5.0-1022

Consider Etcd Backup & Restore Procedure for k3s

consider etcd backup & restore procedure (kubespray does this automatically each time you run it against your K8s cluster) The way K3s is backed up and restored depends on the type of datastore being used. Below are the procedures for backing up and restoring K3s with SQLite, an external datastore, and embedded etcd.

Backup and Restore with SQLite

Backup and Restore with Embedded etcd Datastore

K3s offers a robust mechanism for backing up and restoring the embedded etcd datastore.

For embedded etcd, we can use the k3s etcd-snapshot command for backup and restore operations.

To restore from a snapshot, follow these steps:

  1. Stop the k3s server:

    systemctl stop k3s
  2. Restore the snapshot:

    k3s etcd-snapshot restore --name snapshot-<timestamp> --dir /path/to/backup/
  3. Start the k3s server:

    systemctl start k3s

etcd performance

consider etcd performance - AFAIK, k3s uses sqlite3 DB for the etcd; so there should be some quick perf test for it such as etcdctl check perf we have here

root@node1:~# export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS="https://127.0.0.1:2379"
export ETCDCTL_CACERT="/data/k3s/server/tls/etcd/server-ca.crt"
export ETCDCTL_CERT="/data/k3s/server/tls/etcd/server-client.crt"
export ETCDCTL_KEY="/data/k3s/server/tls/etcd/server-client.key"

root@node1:~# etcdctl -w table member list
+------------------+---------+-------------------------+--------------------------+--------------------------+
|        ID        | STATUS  |          NAME           |        PEER ADDRS        |       CLIENT ADDRS       |
+------------------+---------+-------------------------+--------------------------+--------------------------+
| 34c66c9fb119f95a | started | ip-172-31-39-9-c9a36ec6 | https://172.31.39.9:2380 | https://172.31.39.9:2379 |
+------------------+---------+-------------------------+--------------------------+--------------------------+

root@ip-172-31-39-9:~# etcdctl endpoint health --cluster -w table
+--------------------------+--------+------------+-------+
|         ENDPOINT         | HEALTH |    TOOK    | ERROR |
+--------------------------+--------+------------+-------+
| https://172.31.39.9:2379 |   true | 1.858019ms |       |
+--------------------------+--------+------------+-------+

root@node1:~# etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://172.31.39.9:2379 | 34c66c9fb119f95a |  3.5.13 |  4.4 MB |      true |         2 |      17248 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

root@node1:~# etcdctl check perf
59 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooom  !  98.33%PASS: Throughput is 151 writes/s
PASS: Slowest request took 0.021458s
PASS: Stddev is 0.001235s
PASS

custom K8s configs for the nodefs & imagefs thresholds

custom K8s configs for the nodefs & imagefs thresholds (ref)

To customize disk usage thresholds for nodefs and imagefs, we can modify the kubelet configuration. The kubelet has parameters that allow us to specify eviction thresholds based on filesystem usage.

Example Configuration

Here’s an example of how to configure custom thresholds in the kubelet configuration file:

  1. Edit the Kubelet Configuration File:

    Open the kubelet configuration file in your preferred text editor and add the custom thresholds:

    sudo vi /var/lib/kubelet/config.yaml

    Add the configuration as shown

    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    evictionHard:
     nodefs.available: "10%"
     imagefs.available: "15%"
     nodefs.inodesFree: "5%"
     imagefs.inodesFree: "10%"
  2. Restart the k3s service:

    After modifying the configuration file, restart the k3s service to apply the changes:

    sudo systemctl restart k3s
  3. Monitor Node Conditions:

    Use kubectl to monitor the node conditions and ensure that the eviction thresholds are being respected:

    root@node1:~# kubectl describe node
    Name:               ip-172-31-47-75
    Roles:              control-plane,etcd,master
    Labels:             akash.network=true
    ....
    Conditions:
     Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
     ----                 ------  -----------------                 ------------------                ------                       -------
     NetworkUnavailable   False   Tue, 16 Jul 2024 16:56:14 +0000   Tue, 16 Jul 2024 16:56:14 +0000   CalicoIsUp                   Calico is running on this node
     EtcdIsVoter          True    Wed, 17 Jul 2024 03:35:23 +0000   Tue, 16 Jul 2024 16:55:19 +0000   MemberNotLearner             Node is a voting member of the etcd cluster
     MemoryPressure       False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
     DiskPressure         False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
     PIDPressure          False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
     Ready                True    Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 21:39:21 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled

    custom K8s configs

    custom K8s configs for the max. number of container log files that can be present for a container kubelet_logfiles_max_nr, as well as the max.size of the container log file before it is rotated kubelet_logfiles_max_size (ref)

We can manage custom Kubernetes configurations for the maximum number of container log files and the maximum size of a container log file before it is rotated by configuring the kubelet parameters. These settings help control the disk usage on nodes by limiting the number of log files and their sizes.

Customizing Kubelet Configuration in k3s

To set kubelet_logfiles_max_nr (maximum number of log files) and kubelet_logfiles_max_size (maximum size of log files), we follow these steps:

  1. Create a Kubelet Configuration File:

    Create a configuration file for the kubelet if it doesn't already exist.

    sudo mkdir -p /etc/rancher/k3s
    sudo touch /etc/rancher/k3s/config.yaml
  2. Edit the Kubelet Configuration File:

    Add the following configuration to set the maximum number of log files and the maximum size of log files.

    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    maxContainerLogFiles: 5     # kubelet_logfiles_max_nr
    containerLogMaxSize: "10Mi" # kubelet_logfiles_max_size

    This configuration sets the maximum number of log files per container to 5 and the maximum size of each log file to 10MiB.

  3. Configure k3s to Use the Custom Kubelet Configuration:

    Modify the k3s service file to point to the custom kubelet configuration file. This file is typically located at /etc/systemd/system/k3s.service or /etc/systemd/system/k3s-agent.service for k3s agents.

    Edit the service file to include the custom kubelet configuration.

    sudo vi /etc/systemd/system/k3s.service

    Add the following line to the ExecStart section to use the custom kubelet configuration:

    ExecStart=/usr/local/bin/k3s server --kubelet-arg=config=/etc/rancher/k3s/config.yaml

    For k3s agents, it would look like:

    ExecStart=/usr/local/bin/k3s agent --kubelet-arg=config=/etc/rancher/k3s/config.yaml
  4. Reload and Restart the k3s Service:

    Reload the systemd configuration and restart the k3s service to apply the changes.

    sudo systemctl daemon-reload
    sudo systemctl restart k3s
  5. Verify the Configuration:

    After restarting the k3s service, verify that the kubelet is using the new configuration.

    root@node1:~# kubectl describe node ip-172-31-47-75
    Name:               ip-172-31-47-75
    Roles:              control-plane,etcd,master
    Labels:             akash.network=true
                       beta.kubernetes.io/arch=amd64
    ...
    Annotations:        alpha.kubernetes.io/provided-node-ip: 172.31.47.75
                       k3s.io/node-args:
                       ["server","--apiVersion","kubelet.config.k8s.io/v1beta1","--kind","KubeletConfiguration","--maxContainerLogFiles","5","--containerLogMaxSize","10Mi"]
                       projectcalico.org/IPv4Address: 172.31.47.75/20
andy108369 commented 4 months ago

Great job @devalpatel67 @jigar-arc10 and @chainzero !

andy108369 commented 4 months ago

FWIW, k3s upgrades seem to be straightforward: https://docs.k3s.io/upgrades/manual