Open changtimwu opened 5 years ago
cephfs
is running on top of RADOS.
Ceph provides also a filesystem running on top of the same object storage as RADOS block devices do. A Metadata Server (MDS) is used to map the RADOS backed objects to files and directories, allowing to provide a POSIX-compliant replicated filesystem. This allows one to have a clustered highly available shared filesystem in an easy way if ceph is already used. Its Metadata Servers guarantee that files get balanced out over the whole Ceph cluster, this way even high load will not overload a single host, which can be an issue with traditional shared filesystem approaches, like NFS, for example
Bluestore is rawdevice(partition). Filestore is file, which a filesystem is in the middle.
creating OSD on a partition is not recommended by PVE. https://forum.proxmox.com/threads/how-can-i-create-osd-on-partition.37667/
root@pv1:/var/lib/vz/images# rados -p myshpool bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pv1_52253
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 22 6 23.9999 24 0.882137 0.528877
2 16 44 28 55.9977 88 0.59794 0.782572
3 16 62 46 61.3299 72 0.807239 0.891528
4 16 80 64 63.996 72 0.642039 0.883054
5 16 96 80 63.9957 64 0.784858 0.890982
6 16 117 101 67.3287 84 0.439958 0.894166
7 16 133 117 66.8524 64 0.900278 0.886985
8 16 150 134 66.9952 68 0.810622 0.888864
9 16 167 151 67.1063 68 0.516461 0.901453
10 16 186 170 67.995 76 1.06456 0.902008
Total time run: 10.6124
Total writes made: 187
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 70.4838
Stddev Bandwidth: 17.3845
Max bandwidth (MB/sec): 88
Min bandwidth (MB/sec): 24
Average IOPS: 17
Stddev IOPS: 4.34613
Max IOPS: 22
Min IOPS: 6
Average Latency(s): 0.90435
Stddev Latency(s): 0.279642
Max latency(s): 1.63962
Min latency(s): 0.222066
root@pv1:/var/lib/vz/images# rados -p myshpool bench 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 85 69 275.953 276 0.124554 0.148886
2 16 133 117 233.968 192 0.011733 0.216671
3 16 187 171 227.973 216 0.885628 0.245782
Total time run: 3.50539
Total reads made: 187
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 213.385
Average IOPS: 53
Stddev IOPS: 10.8167
Max IOPS: 69
Min IOPS: 48
Average Latency(s): 0.296811
Max latency(s): 1.33063
Min latency(s): 0.0106387
We setup another 10G NIC to have a separate cluster network.
auto enp1s0f1
iface enp1s0f1 inet static
address 10.0.0.172
netmask 255.255.255.0
then
ifup enp1s0f1
on any node, edit and write corosync.conf
in the following way
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new
cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
see config applying
systemctl status corosync
journalctl -b -u corosync
check status
root@pv1:/var/lib/vz/template/iso# pvecm status
Quorum information
------------------
Date: Tue Sep 24 05:41:00 2019
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1/92
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.0.0.171 (local)
0x00000002 1 10.0.0.173
0x00000003 1 10.0.0.172
https://docs.ceph.com/docs/master/architecture/?highlight=crush#ceph-protocol
__Storage cluster clients and each Ceph OSD Daemon _use the CRUSH algorithm to efficiently compute information about data location, instead of having to depend on a central lookup table._ Ceph’s high-level features include providing a native interface to the Ceph Storage Cluster via librados, and a number of service interfaces built on top of librados.
that's why Ceph doesn't require explicit master.
In virtual machine scenarios, people typically deploy a Ceph Block Device with the rbd network storage driver in QEMU/KVM, where the host machine uses librbd to provide a block device service to the guest. Many cloud computing stacks use libvirt to integrate with hypervisors. You can use thin-provisioned Ceph Block Devices with QEMU and libvirt to support OpenStack and CloudStack among other solutions.
check if a drive is occupied by LVM
ls /sys/block/sdb/holders/
or if pveceph createosd
always complain sdx is in use
and found no holders of it
just use fdisk
to remove all its existed disks
typical migration log
2019-09-30 08:30:53 starting migration of VM 100 to node 'pv1' (172.17.34.171)
2019-09-30 08:30:53 copying disk images
2019-09-30 08:30:53 starting VM 100 on remote node 'pv1'
2019-09-30 08:30:54 start remote tunnel
2019-09-30 08:30:55 ssh tunnel ver 1
2019-09-30 08:30:55 starting online/live migration on unix:/run/qemu-server/100.migrate
2019-09-30 08:30:55 migrate_set_speed: 8589934592
2019-09-30 08:30:55 migrate_set_downtime: 0.1
2019-09-30 08:30:55 set migration_caps
2019-09-30 08:30:55 set cachesize: 67108864
2019-09-30 08:30:55 start migrate command to unix:/run/qemu-server/100.migrate
2019-09-30 08:30:56 migration status: active (transferred 119476085, remaining 427683840), total 554508288)
2019-09-30 08:30:56 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:57 migration status: active (transferred 237195078, remaining 308461568), total 554508288)
2019-09-30 08:30:57 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:58 migration status: active (transferred 354827470, remaining 185790464), total 554508288)
2019-09-30 08:30:58 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 472378098, remaining 63057920), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 484432525, remaining 50589696), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 496438289, remaining 37904384), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration status: active (transferred 508409781, remaining 25907200), total 554508288)
2019-09-30 08:30:59 migration xbzrle cachesize: 67108864 transferred 0 pages 0 cachemiss 0 overflow 0
2019-09-30 08:30:59 migration speed: 128.00 MB/s - downtime 84 ms
2019-09-30 08:30:59 migration status: completed
2019-09-30 08:31:02 migration finished successfully (duration 00:00:09)
TASK OK
provide object storage https://pve.proxmox.com/wiki/User:Grin/Ceph_Object_Gateway
typical CT creation log
/dev/rbd0
mke2fs 1.44.5 (15-Dec-2018)
Discarding device blocks: 4096/83886085771264/8388608 done
Creating filesystem with 8388608 4k blocks and 2097152 inodes
Filesystem UUID: cda44579-b69b-420b-8cf0-a2405e0f8b15
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624
Allocating group tables: 0/256 done
Writing inode tables: 0/256 done
Creating journal (65536 blocks): done
Multiple mount protection is enabled with update interval 5 seconds.
Writing superblocks and filesystem accounting information: 0/256 done
extracting archive '/mnt/pve/cephfs/template/cache/ubuntu-19.04-standard_19.04-1_amd64.tar.gz'
Total bytes read: 661258240 (631MiB, 174MiB/s)
Detected container architecture: amd64
Creating SSH host key 'ssh_host_rsa_key' - this may take some time ...
done: SHA256:Yexh/i4nVz5SaXVK3IZmO6opi72At0KGTo0syxshuRs root@uc3
Creating SSH host key 'ssh_host_ed25519_key' - this may take some time ...
done: SHA256:zs08wUtRCgmo92CnVRnXCTwnoLJhHBqcvTVCKkwNsKY root@uc3
Creating SSH host key 'ssh_host_ecdsa_key' - this may take some time ...
done: SHA256:X7Cjp4TxWIg0LbOaTpgBZHI2onCjNB0InigrnMtdDzE root@uc3
Creating SSH host key 'ssh_host_dsa_key' - this may take some time ...
done: SHA256:ehVjePWRrzCjJAPW7YgJzbw2KyMmtZySpQYrTJV8GwQ root@uc3
TASK OK
Debian Buster is too new for Kubernetes
update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy
however, would this affect PVE6 ? What if PVE 6 has been written for new iptables?
https://askubuntu.com/questions/445487/what-debian-version-are-the-different-ubuntu-versions-based-on/445496#445496
xential
is ubuntu 16. Debian Buster is more Ubuntu 18.04(Bionic Beaver), 18.10(Cosmic Cuttlefish), 19.04(Disco Dingo) based on.
https://phoenixnap.com/kb/install-kubernetes-on-ubuntu
it seems kubernetes-xenial
is compatible with ubuntu 18.04.
root@pv1:~# kubeadm init --pod-network-cidr=10.0.0.0/24 [31/1921]
[init] Using Kubernetes version: v1.16.0
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Pl
ease follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents
are not set to 1
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
kubernet
root@pv1:~# cat /etc/sysctl.d/pve.conf
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
fs.aio-max-nr = 1048576
this explains why PVE choose default 0 for all bridge nf settings.
I don't know if we should ignore swap is enabled. more discussion
when a node and an OSD is down
interesting! It checks default route. That is an important clue why Ceph doesn't work on internal LAN.
I1002 06:15:09.908906 252836 initconfiguration.go:102] detected and using CRI socket: /var/run/dockershim.sock
I1002 06:15:09.909580 252836 interface.go:384] Looking for default routes with IPv4 addresses I1002 06:15:09.909588 252836 interface.go:389] Default route transits interface "vmbr0"
I1002 06:15:09.909726 252836 interface.go:196] Interface vmbr0 is up
I1002 06:15:09.909770 252836 interface.go:244] Interface "vmbr0" has 2 addresses :[172.17.34.171/23 fe80::265e:beff:fe27:f145/$4].
I1002 06:15:09.909784 252836 interface.go:211] Checking addr 172.17.34.171/23.
I1002 06:15:09.909792 252836 interface.go:218] IP found 172.17.34.171 I1002 06:15:09.909798 252836 interface.go:250] Found valid IPv4 address 172.17.34.171 for interface "vmbr0".
I1002 06:15:09.909804 252836 interface.go:395] Found active IP 172.17.34.171
when it works kubeadm_initlog.txt
if you want to stop whole master node
systemctl stop kubepods.slice
or even clean up previous setup of master node
kubeadm reset --v=5 --ignore-preflight-errors=Swap
# kubectl describe nodes pv1
Ready False Mon, 07 Oct 2019 06:09:26 +0800 Mon, 07 Oct 2019 06:07:24 +0800 KubeletNotReady runt
ime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config u
ninitialized
specify master's IP in cluster network and access network
kubeadm init --pod-network-cidr=10.0.0.0/24 --apiserver-advertise-address=10.0.0.171 --apiserver-cert-extra-sans=172.17.34.171 --v=5 --ignore-preflight-errors=Swap
example successful message
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 10.0.0.171:6443 --token 1afogn.tsn7vx4ojynf7x7t \
--discovery-token-ca-cert-hash sha256:f504df1ffbd493e6fa270a6686889a80063b81498d03c9beba07f73a98a58673
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
pv1
is the master.
scp pv1:/etc/kubernetes/admin.conf .kube/config
server
IP address
vi ~/.kube/config
timwu@qpve:~$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
serviceaccount/weave-net created
clusterrole.rbac.authorization.k8s.io/weave-net created
clusterrolebinding.rbac.authorization.k8s.io/weave-net created
role.rbac.authorization.k8s.io/weave-net created
rolebinding.rbac.authorization.k8s.io/weave-net created
daemonset.apps/weave-net created
it would take a while to get ready
timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
pv1 NotReady master 70s v1.16.0
timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
pv1 Ready master 71s v1.16.0
do the join. just follow the succeed messages on master's kubeadm init
plus ignore-preflight-erros=Swap
kubeadm join 10.0.0.171:6443 --token dsq13i.2eivqoqdo5v5o76x --ignore-preflight-errors=Swap --discovery-token-ca-cert-hash sha256:7437ed95ee51d0b271b7fa84455532675124bbd9fb4fc1a2e652d0f139d52aeb
if you have a second adapter not on the default route and to be used for an internal network. On every node(including master and worker), do the following
edit /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
add --node-ip
like this
Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false --node-ip=10.0.0.174"
restart kubelet
to make node-ip take effect
systemctl daemon-reload; systemctl restart kubelet
it takes a while for all work nodes getting ready
timwu@qpve:~/pcluster/pvcl$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
pv1 Ready master 4m39s v1.16.0
pv2 Ready <none> 59s v1.16.0
pv3 Ready <none> 63s v1.16.0
pv4 Ready <none> 75s v1.16.1
the internal network should be right
timwu@qpve:~/pcluster/pvcl$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
pv1 Ready master 54m v1.16.0 10.0.0.171 <none> Debian GNU/Linux 10 (buster) 5.0.15-1-pve docker://18.9.1
pv2 Ready <none> 50m v1.16.0 10.0.0.172 <none> Debian GNU/Linux 10 (buster) 5.0.15-1-pve docker://18.9.1
pv3 Ready <none> 50m v1.16.0 10.0.0.173 <none> Debian GNU/Linux 10 (buster) 5.0.15-1-pve docker://18.9.1
pv4 Ready <none> 51m v1.16.1 10.0.0.174 <none> Debian GNU/Linux 10 (buster) 5.0.15-1-pve docker://18.9.1
add node to cluster by CLI is much easier ex. add 10.0.0.4 to the cluster
pvecm add 10.0.0.173
installing helm on k8s 1.16, you might encounter the following issues https://github.com/helm/helm/issues/6374 https://github.com/helm/helm/issues/5100
I found solution https://github.com/helm/helm/issues/6374#issuecomment-533186177
I also encounter above a problem while installing knative. an universal solution
run RouterOS CHR on pve
create an VM with little disk and 256MB ram.
qemu-img convert -f raw -O qcow2 chr-6.44.5.img vm-103-disk-1.qcow2
qm importdisk 103 vm-103-disk-1.qcow2 local-lvm
update: converting to qcow2 is unecessary. you can direct import img file.
the on webui, attach the new imported and detach/remove the previous disk.
reference https://blog.csdn.net/wdhqwe520/article/details/92787925
CT has its CLI tool called pct
. It's yet another perl implemented PVE tool.
#!/usr/bin/perl -T
use strict;
use warnings;
use PVE::CLI::pct;
PVE::CLI::pct->run_cli_handler();
to see how it wrap LXC /usr/share/perl5/PVE/CLI/pct.pm
use kubectl describe pod
to get which node a pod scheduled to
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/hello-node-7676b5fb8d-k9rc6 to pv3
to get PEM format of CA cert
kubectl config view --raw -o json | jq -r '.clusters[0].cluster."certificate-authority-data"' | tr -d '"' | base64 --decode
reference https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/
join a node to existing cluster
kubeadm token create --print-join-command
https://stackoverflow.com/questions/32322038/adding-node-to-existing-cluster-in-kubernetes
Hey, I'm curious why was I mentioned here? This appears to be an error or spam but I'm getting tagged constantly (~10/month) on these issues randomly from random repos and its very annoying.
Ceph better works on a 10Gb LAN. We recommend a network bandwidth of at least 10 GbE or more, which is used exclusively for Ceph. A meshed network setup 2 is also an option if there are no 10 GbE switches available. The volume of traffic, especially during recovery, will interfere with other services on the same network and may even break the Proxmox VE cluster stack. Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate 10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith will ensure that it isn’t your bottleneck and won’t be anytime soon, 25, 40 or even 100 GBps are possible.