k3s production use considerations (and validation)

andy108369 commented 7 months ago

@chainzero created the k3s method of provider installation, described here https://akashengineers.xyz/provider-build-scripts

Before getting this to the Production use the following points must be considered, addressed/verified to be supported with the k3s K8s cluster deployment method:

[x] scale one all-in-one node installation into three-(or more)-node installation
- [x] consider etcd can be scaled (to avoid SPOF)
- [x] consider control-plane can be scaled
- [x] consider safe scaling down as well (in case of node replacement or simply the provider decides to shrink the amount of nodes for whatever reason)
[x] document node addition (additionally, consider a scenario where that node will be running etcd instance or/and control-plane);
[x] document node removal (additionally, consider a scenario where that node is running etcd instance or/and control-plane);
[x] CNI plugins: Is Calico the main & default K8s networking?
[x] CNI plugins/calico: Consider installation scenario where one would want to specify K8s internal networking as well, primarily for the performance sake (for internal K8s services/apps communication, including Rook-Ceph persistent storage which can be really heavy on the traffic if it is not done via the internal networking which will lead to significant performance lag and bill if provider's traffic is metered)
[x] customize nodefs & imagefs locations: similarly to how it's described here
[x] consider etcd backup & restore procedure (kubespray does this automatically each time you run it against your K8s cluster)
[x] consider etcd performance - AFAIK, k3s uses sqlite3 DB for the etcd; so there should be some quick perf test for it such as etcdctl check perf we have here

Additioanlly/Ideally

[x] custom K8s configs for the nodefs & imagefs thresholds (ref)
[x] custom K8s configs for the max. number of container log files that can be present for a container kubelet_logfiles_max_nr, as well as the max.size of the container log file before it is rotated kubelet_logfiles_max_size (ref)

jigar-arc10 commented 6 months ago

Here is what we found so far from our testing.

While scaling down a node, we tried to use the draining method, but akash-services/operator-inventory-hardware-discovery will cause an issue as it is not a DeamonSet. We should look into it. Force draining worked.
These scripts required root access to the server.
Successfully tested worker node removal - no issue found.
Successfully tested control-plane node removal - no issue found.
Going from single-node to multiple-node provider - no issue found.
The recommended OS is Ubuntu. It failed for Debian at the ingress-nginx installation stage.
So far calico is doing great, we will continue monitoring networking issue if any.

We will continue testing further and will report new findings.

chainzero commented 6 months ago

@jigar-arc10 - thank you for the additional testing.

Thoughts on some of the points raised above:

These scripts required root access to the server.

Current Akash Provider documentation and install process assumes install is being run as root as stated here:

https://akash.network/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/kubernetes-cluster-for-akash-providers/#step-2---install-ansible

As this is part of pre-existing methodologies - do not view this as an issue - but please let us know if you feel otherwise and/or if it will provoke issues in Praetor use.

The recommended OS is Ubuntu. It failed for Debian at the ingress-nginx installation stage.

Current Akash Provider > Helm install based instructions recommend/assume Ubuntu use as stated here:

https://akash.network/docs/providers/build-a-cloud-provider/kubernetes-cluster-for-akash-providers/kubernetes-cluster-for-akash-providers/#kubernetes-cluster-softwarehardware-requirements-and-recommendations

Based on this being part of the pre-existing standard - do not believe this is an issue but please let us know if you feel otherwise and/or if this may cause issues for Praetor users.

While scaling down a node, we tried to use the draining method, but akash-services/operator-inventory-hardware-discovery will cause an issue as it is not a DeamonSet. We should look into it. Force draining worked.

Will look into this issue further. Initial testing of scaling down procedure only tested the ability to scale down K3s nodes. Have not yet tested scaling down with Akash provider and related operators installed. Will test those scenarios ASAP.

jigar-arc10 commented 6 months ago

@chainzero - Thanks for the response.

As this is part of pre-existing methodologies - do not view this as an issue - but please let us know if you feel otherwise and/or if it will provoke issues in Praetor use.

After deep consideration, we agree that root user access should be required as it also helps with GPU driver installation steps.

Based on this being part of the pre-existing standard - do not believe this is an issue but please let us know if you feel otherwise and/or if this may cause issues for Praetor users.

It's a non-issue.

Will look into this issue further. Initial testing of scaling down procedure only tested the ability to scale down K3s nodes. Have not yet tested scaling down with Akash provider and related operators installed. Will test those scenarios ASAP.

After many iterations of testing regarding node removal with updated scripts, the issue about operator-inventory-hardware is gone, and the node was successfully removed.

devalpatel67 commented 4 months ago

Here are the considerations which can be while using k3s instead of k8s.

CNI plugins/calico

CNI plugins/calico: Consider installation scenario where one would want to specify K8s internal networking as well, primarily for the performance sake (for internal K8s services/apps communication, including Rook-Ceph persistent storage which can be really heavy on the traffic if it is not done via the internal networking which will lead to significant performance lag and bill if provider's traffic is metered)

In the K3S setup, we use the default Calico CNI plugin provided by k3s to ensure high performance for internal networking. This configuration is essential for optimizing communication between Kubernetes services and applications, especially for high-traffic services like Rook-Ceph, to prevent significant performance lag and avoid metered external traffic costs.

We verify that Calico is installed and running in our k3s cluster.

root@node1:~# kubectl get pods -n kube-system -l k8s-app=calico-node
NAME                READY   STATUS    RESTARTS   AGE
calico-node-plt4k   1/1     Running   0          4h57m

To define an IP pool for internal networking and ensure efficient internal communication, we use the following configuration:

root@node1:~# kubectl get ippool
NAME                  AGE
default-ipv4-ippool   9h

root@node1:~# kubectl describe ippool default-ipv4-ippool
Name:         default-ipv4-ippool
Namespace:
Labels:       <none>
Annotations:  projectcalico.org/metadata: {"uid":"cf9f2f1f-c77e-463e-8574-d9b6ea72d055","creationTimestamp":"2024-07-16T16:56:14Z"}
API Version:  crd.projectcalico.org/v1
Kind:         IPPool
Metadata:
 Creation Timestamp:  2024-07-16T16:56:14Z
 Generation:          1
 Resource Version:    712
 UID:                 b3def60d-9f8b-46d8-9ff8-42c1de61412a
Spec:
 Allowed Uses:
   Workload
   Tunnel
 Block Size:     26
 Cidr:           192.168.0.0/16
 Ipip Mode:      Always
 Nat Outgoing:   true
 Node Selector:  all()
 Vxlan Mode:     Never
Events:           <none>

Define Network Policies (If needed) We create network policies to manage traffic flow and ensure internal communication is optimized for performance.

kubectl apply -f - <<EOF
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
 name: allow-rook-ceph
 namespace: rook-ceph
spec:
 selector: all()
 ingress:
 - action: Allow
   source:
     namespaceSelector: has(role)
     selector: app == 'rook-ceph'
 egress:
 - action: Allow
   destination:
     namespaceSelector: has(role)
     selector: app == 'rook-ceph'
EOF

customize `nodefs` & `imagefs` locations

customize nodefs & imagefs locations: similarly to how it's described here

To manage storage effectively, we can customize the locations for nodefs and imagefs in k3s. This involves setting custom data directories and configuring containerd, the container runtime used by k3s.

At this point we imagine, we created RAID0 over 2 NVME using the following commands:

root@node1:~# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0 63.9M  1 loop /snap/core20/2318
loop1          7:1    0 25.2M  1 loop /snap/amazon-ssm-agent/7993
loop2          7:2    0   87M  1 loop /snap/lxd/28373
loop3          7:3    0 55.7M  1 loop /snap/core18/2829
loop4          7:4    0 38.8M  1 loop /snap/snapd/21759
nvme0n1      259:0    0   80G  0 disk
├─nvme0n1p1  259:1    0 79.9G  0 part /
├─nvme0n1p14 259:2    0    4M  0 part
└─nvme0n1p15 259:3    0  106M  0 part /boot/efi
nvme1n1      259:4    0  100G  0 disk
nvme2n1      259:5    0  100G  0 disk

root@node1:~# mdadm --create /dev/md0 --level=raid0 --metadata=1.2 --raid-devices=2 /dev/nvme1n1 /dev/nvme2n1
mdadm: array /dev/md0 started.

root@node1:~# cat /proc/mdstat
Personalities : [raid0]
md0 : active raid0 nvme2n1[1] nvme1n1[0]
      209582080 blocks super 1.2 512k chunks
unused devices: <none>

root@node1:~# mkfs.ext4 /dev/md0
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 52395520 4k blocks and 13099008 inodes
Filesystem UUID: b1ea6725-0d38-42d2-a9c8-3071d8c7c5de
Superblock backups stored on blocks:
  32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
  4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

root@node1:~# cp -p /etc/fstab /etc/fstab.1

root@node1:~# cat >> /etc/fstab << EOF
UUID="$(blkid /dev/md0 -s UUID -o value)"  /data        ext4   defaults,discard  0 0
EOF

root@node1:~# diff -Nur /etc/fstab.1 /etc/fstab
--- /etc/fstab.1    2024-07-01 15:42:56.210521795 +0000
+++ /etc/fstab  2024-07-17 04:07:18.985153190 +0000
@@ -1,2 +1,3 @@
LABEL=cloudimg-rootfs   /    ext4   discard,errors=remount-ro   0 1
LABEL=UEFI  /boot/efi   vfat    umask=0077  0 1
+UUID="28b606d9-6e43-4a0b-be60-c7cda95b71e4"  /data        ext4   defaults,discard  0 0

root@node1:~# mkdir /data
mount /data

root@node1:~# df -Ph /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        196G   28K  186G   1% /data

root@node1:~# /usr/share/mdadm/mkconf > /etc/mdadm/mdadm.conf

root@node1:~# cat /etc/mdadm/mdadm.conf | grep -v ^\#
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md/0  metadata=1.2 UUID=1e921d7f:4b06d544:42f0e25f:a252e4e1 name=ip-172-31-47-75:0

root@node1:~# update-initramfs -c -k all
update-initramfs: Generating /boot/initrd.img-6.5.0-1022

Setting up k3s with custom location of k3s
1. Set Custom Data Directory: During the k3s installation, we specify a custom data directory using the --data-dir option. This ensures that all k3s-related data, including nodefs and imagefs, are stored in the specified directory.
```
curl -sfL https://get.k3s.io | sh -s - --data-dir /custom/path/to/k3s/data
```
1. Customize Containerd Configuration: We then customize the containerd configuration to specify the root and state directories for containerd, which impacts where container images and container writable layers are stored.
```
sudo vi /etc/rancher/k3s/config.toml
```
Add the following configuration:
```
[plugins."io.containerd.grpc.v1.cri".containerd]
root = "/custom/path/to/containerd/root"
state = "/custom/path/to/containerd/state"
```
1. Restart k3s: After modifying the containerd configuration, we restart the k3s service to apply the changes.
```
sudo systemctl restart k3s
```
Verifying the Configuration:
1. Check Containerd Root Directory: We verify that containerd is using the new storage locations by listing the contents of the specified directories.
```
ls /custom/path/to/containerd/root
```
1. Check Running Containers: We ensure that the running containers are using the new storage locations by checking the containerd containers list.
```
sudo ctr -n k8s.io containers list
```
1. Monitor Disk Usage: We monitor disk usage to ensure that nodefs and imagefs are being utilized as expected.
```
df -h /custom/path/to/containerd/root
```
Moving running k3s to a new mounted volume To move your existing k3s setup to use a new mounted volume at /data, follow these steps:
1. Stop the k3s Service:
```
sudo systemctl stop k3s
```
1. Copy Existing Data to the New Volume:
```
sudo rsync -a /var/lib/rancher/k3s/ /data/k3s/
```
1. Update k3s to Use the New Data Directory: Edit the k3s service file to point to the new data directory.
```
sudo vi /etc/systemd/system/k3s.service
```
Update the ExecStart line:
```
ExecStart=/usr/local/bin/k3s server --data-dir /data/k3s
```
Reload the systemd configuration:
```
sudo systemctl daemon-reload
```
1. Check kubelet Configuration:
Ensure that kubelet is configured to use the correct paths. If the kubelet configuration file exists, it might need to be updated.
```
sudo vi /etc/rancher/k3s/config.yaml
```
Confirm or add the following configurations, if necessary:
```
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
rootDir: "/data/k3s"
```
1. Start the k3s Service:
```
sudo systemctl start k3s
```
1. Verify the Configuration:
```
kubectl describe node node1
ls /data/k3s
ls /data/k3s/agent/containerd
```

Consider Etcd Backup & Restore Procedure for k3s

consider etcd backup & restore procedure (kubespray does this automatically each time you run it against your K8s cluster) The way K3s is backed up and restored depends on the type of datastore being used. Below are the procedures for backing up and restoring K3s with SQLite, an external datastore, and embedded etcd.

Backup and Restore with SQLite

Backup No special commands are required to back up the SQLite datastore. To back up the SQLite datastore, take a copy of the following directory:
```
/var/lib/rancher/k3s/server/db/
```
Restore To restore the SQLite datastore, restore the contents of the directory mentioned above, and also restore the server token file:
```
/var/lib/rancher/k3s/server/token
```
The token file must be restored or its value must be passed into the --token option when restoring from backup. If you do not use the same token value when restoring, the snapshot will be unusable, as the token is used to encrypt confidential data within the datastore itself.

Backup and Restore with Embedded etcd Datastore

K3s offers a robust mechanism for backing up and restoring the embedded etcd datastore.

Automated Snapshot Creation By default, k3s is configured to automatically create snapshots of the etcd datastore twice daily, at 00:00 and 12:00 system time. These snapshots ensure that we have recent backups of your cluster state. The snapshots are retained in the ${data-dir}/server/db/snapshots directory, which defaults to /var/lib/rancher/k3s/server/db/snapshots. K3s retains the five most recent snapshots, but configuration options can adjust this number.
- Configuring Snapshot Options
We can customize the snapshot frequency and retention using the following options:

For embedded etcd, we can use the k3s etcd-snapshot command for backup and restore operations.

Backup To perform an on-demand snapshot of the etcd datastore, we use the following command:
```
k3s etcd-snapshot save
```
This command will create a snapshot and save it to the default location /var/lib/rancher/k3s/server/db/snapshots/. We can specify a custom directory and name for the snapshot as well:
```
k3s etcd-snapshot save --name my-snapshot --dir /path/to/backup/
```
Restore

To restore from a snapshot, follow these steps:

Stop the k3s server:
```
systemctl stop k3s
```

Restore the snapshot:

k3s etcd-snapshot restore --name snapshot-<timestamp> --dir /path/to/backup/

Start the k3s server:
```
systemctl start k3s
```

etcd performance

consider etcd performance - AFAIK, k3s uses sqlite3 DB for the etcd; so there should be some quick perf test for it such as etcdctl check perf we have here

root@node1:~# export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS="https://127.0.0.1:2379"
export ETCDCTL_CACERT="/data/k3s/server/tls/etcd/server-ca.crt"
export ETCDCTL_CERT="/data/k3s/server/tls/etcd/server-client.crt"
export ETCDCTL_KEY="/data/k3s/server/tls/etcd/server-client.key"

root@node1:~# etcdctl -w table member list
+------------------+---------+-------------------------+--------------------------+--------------------------+
|        ID        | STATUS  |          NAME           |        PEER ADDRS        |       CLIENT ADDRS       |
+------------------+---------+-------------------------+--------------------------+--------------------------+
| 34c66c9fb119f95a | started | ip-172-31-39-9-c9a36ec6 | https://172.31.39.9:2380 | https://172.31.39.9:2379 |
+------------------+---------+-------------------------+--------------------------+--------------------------+

root@ip-172-31-39-9:~# etcdctl endpoint health --cluster -w table
+--------------------------+--------+------------+-------+
|         ENDPOINT         | HEALTH |    TOOK    | ERROR |
+--------------------------+--------+------------+-------+
| https://172.31.39.9:2379 |   true | 1.858019ms |       |
+--------------------------+--------+------------+-------+

root@node1:~# etcdctl endpoint status --cluster -w table
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://172.31.39.9:2379 | 34c66c9fb119f95a |  3.5.13 |  4.4 MB |      true |         2 |      17248 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

root@node1:~# etcdctl check perf
59 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooom  !  98.33%PASS: Throughput is 151 writes/s
PASS: Slowest request took 0.021458s
PASS: Stddev is 0.001235s
PASS

custom K8s configs for the nodefs & imagefs thresholds

custom K8s configs for the nodefs & imagefs thresholds (ref)

To customize disk usage thresholds for nodefs and imagefs, we can modify the kubelet configuration. The kubelet has parameters that allow us to specify eviction thresholds based on filesystem usage.

Example Configuration

Here’s an example of how to configure custom thresholds in the kubelet configuration file:

Edit the Kubelet Configuration File:

Open the kubelet configuration file in your preferred text editor and add the custom thresholds:

sudo vi /var/lib/kubelet/config.yaml

Add the configuration as shown

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
 nodefs.available: "10%"
 imagefs.available: "15%"
 nodefs.inodesFree: "5%"
 imagefs.inodesFree: "10%"

Restart the k3s service:

After modifying the configuration file, restart the k3s service to apply the changes:
```
sudo systemctl restart k3s
```

Monitor Node Conditions:

Use kubectl to monitor the node conditions and ensure that the eviction thresholds are being respected:

root@node1:~# kubectl describe node
Name:               ip-172-31-47-75
Roles:              control-plane,etcd,master
Labels:             akash.network=true
....
Conditions:
 Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
 ----                 ------  -----------------                 ------------------                ------                       -------
 NetworkUnavailable   False   Tue, 16 Jul 2024 16:56:14 +0000   Tue, 16 Jul 2024 16:56:14 +0000   CalicoIsUp                   Calico is running on this node
 EtcdIsVoter          True    Wed, 17 Jul 2024 03:35:23 +0000   Tue, 16 Jul 2024 16:55:19 +0000   MemberNotLearner             Node is a voting member of the etcd cluster
 MemoryPressure       False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
 DiskPressure         False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
 PIDPressure          False   Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 16:55:04 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
 Ready                True    Wed, 17 Jul 2024 03:35:58 +0000   Tue, 16 Jul 2024 21:39:21 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled

custom K8s configs

custom K8s configs for the max. number of container log files that can be present for a container kubelet_logfiles_max_nr, as well as the max.size of the container log file before it is rotated kubelet_logfiles_max_size (ref)

We can manage custom Kubernetes configurations for the maximum number of container log files and the maximum size of a container log file before it is rotated by configuring the kubelet parameters. These settings help control the disk usage on nodes by limiting the number of log files and their sizes.

Customizing Kubelet Configuration in k3s

To set kubelet_logfiles_max_nr (maximum number of log files) and kubelet_logfiles_max_size (maximum size of log files), we follow these steps:

Create a Kubelet Configuration File:

Create a configuration file for the kubelet if it doesn't already exist.
```
sudo mkdir -p /etc/rancher/k3s
sudo touch /etc/rancher/k3s/config.yaml
```
Edit the Kubelet Configuration File:

Add the following configuration to set the maximum number of log files and the maximum size of log files.
```
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxContainerLogFiles: 5     # kubelet_logfiles_max_nr
containerLogMaxSize: "10Mi" # kubelet_logfiles_max_size
```
This configuration sets the maximum number of log files per container to 5 and the maximum size of each log file to 10MiB.
Configure k3s to Use the Custom Kubelet Configuration:

Modify the k3s service file to point to the custom kubelet configuration file. This file is typically located at /etc/systemd/system/k3s.service or /etc/systemd/system/k3s-agent.service for k3s agents.

Edit the service file to include the custom kubelet configuration.
```
sudo vi /etc/systemd/system/k3s.service
```
Add the following line to the ExecStart section to use the custom kubelet configuration:
```
ExecStart=/usr/local/bin/k3s server --kubelet-arg=config=/etc/rancher/k3s/config.yaml
```
For k3s agents, it would look like:
```
ExecStart=/usr/local/bin/k3s agent --kubelet-arg=config=/etc/rancher/k3s/config.yaml
```
Reload and Restart the k3s Service:

Reload the systemd configuration and restart the k3s service to apply the changes.
```
sudo systemctl daemon-reload
sudo systemctl restart k3s
```

Verify the Configuration:

After restarting the k3s service, verify that the kubelet is using the new configuration.

root@node1:~# kubectl describe node ip-172-31-47-75
Name:               ip-172-31-47-75
Roles:              control-plane,etcd,master
Labels:             akash.network=true
                   beta.kubernetes.io/arch=amd64
...
Annotations:        alpha.kubernetes.io/provided-node-ip: 172.31.47.75
                   k3s.io/node-args:
                   ["server","--apiVersion","kubelet.config.k8s.io/v1beta1","--kind","KubeletConfiguration","--maxContainerLogFiles","5","--containerLogMaxSize","10Mi"]
                   projectcalico.org/IPv4Address: 172.31.47.75/20

andy108369 commented 4 months ago

Great job @devalpatel67 @jigar-arc10 and @chainzero !

andy108369 commented 4 months ago

FWIW, k3s upgrades seem to be straightforward: https://docs.k3s.io/upgrades/manual

akash-network / support