changtimwu commented 5 years ago

Ceph better works on a 10Gb LAN. We recommend a network bandwidth of at least 10 GbE or more, which is used exclusively for Ceph. A meshed network setup 2 is also an option if there are no 10 GbE switches available. The volume of traffic, especially during recovery, will interfere with other services on the same network and may even break the Proxmox VE cluster stack. Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith will ensure that it isn’t your bottleneck and won’t be anytime soon, 25, 40 or even 100 GBps are possible.

changtimwu commented 5 years ago

cephfs is running on top of RADOS.

Ceph provides also a filesystem running on top of the same object storage as RADOS block devices do. A Metadata Server (MDS) is used to map the RADOS backed objects to files and directories, allowing to provide a POSIX-compliant replicated filesystem. This allows one to have a clustered highly available shared filesystem in an easy way if ceph is already used. Its Metadata Servers guarantee that files get balanced out over the whole Ceph cluster, this way even high load will not overload a single host, which can be an issue with traditional shared filesystem approaches, like NFS, for example

changtimwu commented 5 years ago

OSD construction

Bluestore vs Filestore

Bluestore is rawdevice(partition). Filestore is file, which a filesystem is in the middle.

https://docs.ceph.com/docs/master/rados/configuration/storage-devices/

Bluestore device

creating OSD on a partition is not recommended by PVE. https://forum.proxmox.com/threads/how-can-i-create-osd-on-partition.37667/

changtimwu commented 5 years ago

root@pv1:/var/lib/vz/images# rados -p myshpool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pv1_52253
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        22         6   23.9999        24    0.882137    0.528877
    2      16        44        28   55.9977        88     0.59794    0.782572
    3      16        62        46   61.3299        72    0.807239    0.891528
    4      16        80        64    63.996        72    0.642039    0.883054
    5      16        96        80   63.9957        64    0.784858    0.890982
    6      16       117       101   67.3287        84    0.439958    0.894166
    7      16       133       117   66.8524        64    0.900278    0.886985
    8      16       150       134   66.9952        68    0.810622    0.888864
    9      16       167       151   67.1063        68    0.516461    0.901453
   10      16       186       170    67.995        76     1.06456    0.902008
Total time run:         10.6124
Total writes made:      187
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     70.4838
Stddev Bandwidth:       17.3845
Max bandwidth (MB/sec): 88
Min bandwidth (MB/sec): 24
Average IOPS:           17
Stddev IOPS:            4.34613
Max IOPS:               22
Min IOPS:               6
Average Latency(s):     0.90435
Stddev Latency(s):      0.279642
Max latency(s):         1.63962
Min latency(s):         0.222066

changtimwu commented 5 years ago

root@pv1:/var/lib/vz/images# rados -p myshpool bench 10 seq               
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        85        69   275.953       276    0.124554    0.148886
    2      16       133       117   233.968       192    0.011733    0.216671
    3      16       187       171   227.973       216    0.885628    0.245782
Total time run:       3.50539
Total reads made:     187
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   213.385
Average IOPS:         53
Stddev IOPS:          10.8167
Max IOPS:             69
Min IOPS:             48
Average Latency(s):   0.296811
Max latency(s):       1.33063
Min latency(s):       0.0106387

changtimwu commented 5 years ago

changtimwu commented 5 years ago

We setup another 10G NIC to have a separate cluster network.

auto enp1s0f1
iface enp1s0f1 inet static
        address 10.0.0.172
        netmask 255.255.255.0

then

ifup enp1s0f1

on any node, edit and write corosync.conf in the following way

cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf

see config applying

systemctl status corosync
journalctl -b -u corosync

check status

root@pv1:/var/lib/vz/template/iso# pvecm status
Quorum information
------------------
Date:             Tue Sep 24 05:41:00 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/92
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.0.171 (local)
0x00000002          1 10.0.0.173
0x00000003          1 10.0.0.172

changtimwu commented 5 years ago

https://docs.ceph.com/docs/master/architecture/?highlight=crush#ceph-protocol

Crush

__Storage cluster clients and each Ceph OSD Daemon _use the CRUSH algorithm to efficiently compute information about data location, instead of having to depend on a central lookup table._ Ceph’s high-level features include providing a native interface to the Ceph Storage Cluster via librados, and a number of service interfaces built on top of librados.

that's why Ceph doesn't require explicit master.

RBD Direct pass-through to VM

In virtual machine scenarios, people typically deploy a Ceph Block Device with the rbd network storage driver in QEMU/KVM, where the host machine uses librbd to provide a block device service to the guest. Many cloud computing stacks use libvirt to integrate with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and libvirt to support OpenStack and CloudStack among other solutions.

changtimwu commented 5 years ago

check if a drive is occupied by LVM

ls  /sys/block/sdb/holders/

or if pveceph createosd always complain sdx is in use and found no holders of it just use fdisk to remove all its existed disks

changtimwu commented 5 years ago

typical migration log

2019-09-30 08:30:53 starting migration of VM 100 to node 'pv1' (172.17.34.171)
2019-09-30 08:30:53 copying disk images
2019-09-30 08:30:53 starting VM 100 on remote node 'pv1'
2019-09-30 08:30:54 start remote tunnel
2019-09-30 08:30:55 ssh tunnel ver 1
2019-09-30 08:30:55 starting online/live migration on unix:/run/qemu-server/100.migrate
2019-09-30 08:30:55 migrate_set_speed: 8589934592
2019-09-30 08:30:55 migrate_set_downtime: 0.1
2019-09-30 08:30:55 set migration_caps
2019-09-30 08:30:55 set cachesize: 67108864
2019-09-30 08:30:55 start migrate command to unix:/run/qemu-server/100.migrate
2019-09-30 08:30:56 migration status: active (transferred 119476085, remaining 427683840), total 554508288)
2019-09-30 08:30:56 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:57 migration status: active (transferred 237195078, remaining 308461568), total 554508288)
2019-09-30 08:30:57 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:58 migration status: active (transferred 354827470, remaining 185790464), total 554508288)
2019-09-30 08:30:58 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 472378098, remaining 63057920), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 484432525, remaining 50589696), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 496438289, remaining 37904384), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 508409781, remaining 25907200), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration speed: 128.00 MB/s - downtime 84 ms
2019-09-30 08:30:59 migration status: completed
2019-09-30 08:31:02 migration finished successfully (duration 00:00:09)
TASK OK

changtimwu commented 5 years ago

provide object storage https://pve.proxmox.com/wiki/User:Grin/Ceph_Object_Gateway

changtimwu commented 5 years ago

typical CT creation log

/dev/rbd0
mke2fs 1.44.5 (15-Dec-2018)
Discarding device blocks:    4096/83886085771264/8388608               done                            
Creating filesystem with 8388608 4k blocks and 2097152 inodes
Filesystem UUID: cda44579-b69b-420b-8cf0-a2405e0f8b15
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624

Allocating group tables:   0/256       done                            
Writing inode tables:   0/256       done                            
Creating journal (65536 blocks): done
Multiple mount protection is enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information:   0/256       done

extracting archive '/mnt/pve/cephfs/template/cache/ubuntu-19.04-standard_19.04-1_amd64.tar.gz'
Total bytes read: 661258240 (631MiB, 174MiB/s)
Detected container architecture: amd64
Creating SSH host key 'ssh_host_rsa_key' - this may take some time ...
done: SHA256:Yexh/i4nVz5SaXVK3IZmO6opi72At0KGTo0syxshuRs root@uc3
Creating SSH host key 'ssh_host_ed25519_key' - this may take some time ...
done: SHA256:zs08wUtRCgmo92CnVRnXCTwnoLJhHBqcvTVCKkwNsKY root@uc3
Creating SSH host key 'ssh_host_ecdsa_key' - this may take some time ...
done: SHA256:X7Cjp4TxWIg0LbOaTpgBZHI2onCjNB0InigrnMtdDzE root@uc3
Creating SSH host key 'ssh_host_dsa_key' - this may take some time ...
done: SHA256:ehVjePWRrzCjJAPW7YgJzbw2KyMmtZySpQYrTJV8GwQ root@uc3
TASK OK

changtimwu commented 5 years ago

Debian Buster is too new for Kubernetes

https://discuss.kubernetes.io/t/kubernetes-compatible-with-debian-10-buster/7853

https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/

update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy

however, would this affect PVE6 ? What if PVE 6 has been written for new iptables?

https://askubuntu.com/questions/445487/what-debian-version-are-the-different-ubuntu-versions-based-on/445496#445496 xential is ubuntu 16. Debian Buster is more Ubuntu 18.04(Bionic Beaver), 18.10(Cosmic Cuttlefish), 19.04(Disco Dingo) based on.

https://phoenixnap.com/kb/install-kubernetes-on-ubuntu it seems kubernetes-xenial is compatible with ubuntu 18.04.

changtimwu commented 5 years ago

root@pv1:~# kubeadm init  --pod-network-cidr=10.0.0.0/24                                                               [31/1921]
[init] Using Kubernetes version: v1.16.0
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Pl
ease follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents
are not set to 1           
        [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

changtimwu commented 5 years ago

kubernet

root@pv1:~# cat /etc/sysctl.d/pve.conf
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
fs.aio-max-nr = 1048576

this explains why PVE choose default 0 for all bridge nf settings.

I don't know if we should ignore swap is enabled. more discussion

changtimwu commented 5 years ago

when a node and an OSD is down

changtimwu commented 5 years ago

interesting! It checks default route. That is an important clue why Ceph doesn't work on internal LAN.

I1002 06:15:09.908906  252836 initconfiguration.go:102] detected and using CRI socket: /var/run/dockershim.sock
I1002 06:15:09.909580  252836 interface.go:384] Looking for default routes with IPv4 addresses                                  I1002 06:15:09.909588  252836 interface.go:389] Default route transits interface "vmbr0"   
I1002 06:15:09.909726  252836 interface.go:196] Interface vmbr0 is up                   
I1002 06:15:09.909770  252836 interface.go:244] Interface "vmbr0" has 2 addresses :[172.17.34.171/23 fe80::265e:beff:fe27:f145/$4].                                                                                        
I1002 06:15:09.909784  252836 interface.go:211] Checking addr  172.17.34.171/23.          
I1002 06:15:09.909792  252836 interface.go:218] IP found 172.17.34.171                                                          I1002 06:15:09.909798  252836 interface.go:250] Found valid IPv4 address 172.17.34.171 for interface "vmbr0".
I1002 06:15:09.909804  252836 interface.go:395] Found active IP 172.17.34.171

changtimwu commented 5 years ago

when it works kubeadm_initlog.txt

if you want to stop whole master node

systemctl stop kubepods.slice

or even clean up previous setup of master node

 kubeadm reset  --v=5  --ignore-preflight-errors=Swap

changtimwu commented 5 years ago

# kubectl describe nodes pv1
  Ready            False   Mon, 07 Oct 2019 06:09:26 +0800   Mon, 07 Oct 2019 06:07:24 +0800   KubeletNotReady              runt
ime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config u
ninitialized

changtimwu commented 5 years ago

complete guide for dual network interface clusters

on master node

specify master's IP in cluster network and access network

kubeadm init  --pod-network-cidr=10.0.0.0/24  --apiserver-advertise-address=10.0.0.171  --apiserver-cert-extra-sans=172.17.34.171 --v=5  --ignore-preflight-errors=Swap

example successful message

Your Kubernetes control-plane has initialized successfully!                                                                     

To start using your cluster, you need to run the following as a regular user:                                                   

  mkdir -p $HOME/.kube                                                                
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config                                                                      
  sudo chown $(id -u):$(id -g) $HOME/.kube/config                                                                          

You should now deploy a pod network to the cluster.                                      
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:                                                     
  https://kubernetes.io/docs/concepts/cluster-administration/addons/                                                            

Then you can join any number of worker nodes by running the following on each as root:                                          

kubeadm join 10.0.0.171:6443 --token 1afogn.tsn7vx4ojynf7x7t \                                               
    --discovery-token-ca-cert-hash sha256:f504df1ffbd493e6fa270a6686889a80063b81498d03c9beba07f73a98a58673

setup control plane

master as control node. Just follow the success message.

mkdir -p $HOME/.kube 
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config

a dedicated host as control node. in this case pv1 is the master.
```
scp pv1:/etc/kubernetes/admin.conf .kube/config 
```
edit its content. Replace destined IP address with server IP address
```
vi ~/.kube/config
```

install any CNI this step is essential or all your worker nodes would be not ready

timwu@qpve:~$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"         
serviceaccount/weave-net created
clusterrole.rbac.authorization.k8s.io/weave-net created
clusterrolebinding.rbac.authorization.k8s.io/weave-net created
role.rbac.authorization.k8s.io/weave-net created
rolebinding.rbac.authorization.k8s.io/weave-net created
daemonset.apps/weave-net created

check master node status

it would take a while to get ready

timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME   STATUS     ROLES    AGE   VERSION
pv1    NotReady   master   70s   v1.16.0
timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME   STATUS   ROLES    AGE   VERSION
pv1    Ready    master   71s   v1.16.0

install worker nodes

do the join. just follow the succeed messages on master's kubeadm init plus ignore-preflight-erros=Swap

kubeadm join 10.0.0.171:6443 --token dsq13i.2eivqoqdo5v5o76x  --ignore-preflight-errors=Swap   --discovery-token-ca-cert-hash sha256:7437ed95ee51d0b271b7fa84455532675124bbd9fb4fc1a2e652d0f139d52aeb

fix internal network

if you have a second adapter not on the default route and to be used for an internal network. On every node(including master and worker), do the following

edit /etc/systemd/system/kubelet.service.d/10-kubeadm.conf add --node-ip like this

Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false --node-ip=10.0.0.174"

restart kubelet to make node-ip take effect

systemctl daemon-reload; systemctl restart kubelet

check cluster status

it takes a while for all work nodes getting ready

timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME   STATUS   ROLES    AGE     VERSION
pv1    Ready    master   4m39s   v1.16.0
pv2    Ready    <none>   59s     v1.16.0
pv3    Ready    <none>   63s     v1.16.0
pv4    Ready    <none>   75s     v1.16.1

the internal network should be right

timwu@qpve:~/pcluster/pvcl$ kubectl get nodes -o wide
NAME   STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
pv1    Ready    master   54m   v1.16.0   10.0.0.171    <none>        Debian GNU/Linux 10 (buster)   5.0.15-1-pve     docker://18.9.1
pv2    Ready    <none>   50m   v1.16.0   10.0.0.172    <none>        Debian GNU/Linux 10 (buster)   5.0.15-1-pve     docker://18.9.1
pv3    Ready    <none>   50m   v1.16.0   10.0.0.173    <none>        Debian GNU/Linux 10 (buster)   5.0.15-1-pve     docker://18.9.1
pv4    Ready    <none>   51m   v1.16.1   10.0.0.174    <none>        Debian GNU/Linux 10 (buster)   5.0.15-1-pve     docker://18.9.1

changtimwu commented 5 years ago

add node to cluster by CLI is much easier ex. add 10.0.0.4 to the cluster

 pvecm add 10.0.0.173

changtimwu commented 5 years ago

installing helm on k8s 1.16, you might encounter the following issues https://github.com/helm/helm/issues/6374 https://github.com/helm/helm/issues/5100

I found solution https://github.com/helm/helm/issues/6374#issuecomment-533186177

changtimwu commented 5 years ago

I also encounter above a problem while installing knative. an universal solution

wget the yaml file
replace api version with what listed here

changtimwu commented 5 years ago

run RouterOS CHR on pve

create an VM with little disk and 256MB ram.

qemu-img convert -f raw -O qcow2 chr-6.44.5.img vm-103-disk-1.qcow2 
qm importdisk 103    vm-103-disk-1.qcow2  local-lvm

update: converting to qcow2 is unecessary. you can direct import img file.

the on webui, attach the new imported and detach/remove the previous disk.

reference https://blog.csdn.net/wdhqwe520/article/details/92787925

changtimwu commented 5 years ago

CT has its CLI tool called pct. It's yet another perl implemented PVE tool.

#!/usr/bin/perl -T

use strict;
use warnings;

use PVE::CLI::pct;

PVE::CLI::pct->run_cli_handler();

to see how it wrap LXC /usr/share/perl5/PVE/CLI/pct.pm

changtimwu commented 5 years ago

use kubectl describe pod to get which node a pod scheduled to

  Type    Reason     Age        From               Message
  ----    ------     ----       ----               -------
  Normal  Scheduled  <unknown>  default-scheduler  Successfully assigned default/hello-node-7676b5fb8d-k9rc6 to pv3

changtimwu commented 5 years ago

to get PEM format of CA cert

kubectl config view --raw -o json | jq -r '.clusters[0].cluster."certificate-authority-data"' | tr -d '"' | base64 --decode

reference https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/

changtimwu commented 4 years ago

join a node to existing cluster

kubeadm token create --print-join-command

https://stackoverflow.com/questions/32322038/adding-node-to-existing-cluster-in-kubernetes

0xcaff commented 8 months ago

Hey, I'm curious why was I mentioned here? This appears to be an error or spam but I'm getting tagged constantly (~10/month) on these issues randomly from random repos and its very annoying.

changtimwu / changtimwu.github.com

PVE notes #101

OSD construction

Bluestore vs Filestore

Bluestore device

Crush

RBD Direct pass-through to VM

complete guide for dual network interface clusters

on master node

setup control plane

check master node status

install worker nodes

fix internal network

check cluster status