用 kubeadm 在 CentOS 7 上部署 K8s 集群

本文记录在 Mac 上通过 kubeadm 在 VirtualBox 中的 CentOS 7 部署 K8s 集群的过程。

环境：

本地系统：macOS 10.15.7
VirtualBox 版本：6.1.16 r140961
CentOS 镜像版本：CentOS-7-x86_64-DVD-2009.iso

1. 安装 CentOS 7

新建虚拟机：

内存分配 2G；
文件位置：Users/***/vm/kube1/kube1.vdi。
硬盘分配 20G。

虚拟机设置：

CPU 数量：2
网卡模式：选择桥接网卡，混杂模式选全部允许，这样虚拟机可以访问外网，与宿主机也可以互相通信。

安装 CentOS 7：

设置 CentOS 镜像：选择虚拟机 - 设置 - 存储 - 控制器：IDE 盘片 - 分配光驱 - 选择下载的镜像 - OK
在 VB 界面点击启动：
- 设置时区：Asia/Shanghai
- SOFTWARE SELECTION：minimal install
- INSTALLATION DESTINATION：Automatic partitioning
- NETWORK & HOST NAME：打开 OnBoot，打开后外面显示为 Wired (enp0s3)，Host name：kube0.vm（如果没有主机名设置的地方，可以在系统安装完成后用命令 hostnamectl 设置。
- 安装过程中可以设置 root 用户密码或者添加新用户。

1.1 配置网卡为静态 IP

查看网卡信息：

> ip addr
enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

那么在系统中网卡设备名为：ifcfg-enp0s3

修改配置文件：

> vi /etc/sysconfig/network-scripts/ifcfg-enp0s3

TYPE=Ethernet                # 网卡类型：为以太网
PROXY_METHOD=none            # 代理方式：关闭状态
BROWSER_ONLY=no                # 只是浏览器：否
BOOTPROTO=dhcp                # 网卡的引导协议：DHCP[中文名称: 动态主机配置协议]
DEFROUTE=yes                # 默认路由：是, 不明白的可以百度关键词 `默认路由` 
IPV4_FAILURE_FATAL=no        # 是不开启IPV4致命错误检测：否
IPV6INIT=yes                # IPV6是否自动初始化: 是[不会有任何影响, 现在还没用到IPV6]
IPV6_AUTOCONF=yes            # IPV6是否自动配置：是[不会有任何影响, 现在还没用到IPV6]
IPV6_DEFROUTE=yes            # IPV6是否可以为默认路由：是[不会有任何影响, 现在还没用到IPV6]
IPV6_FAILURE_FATAL=no        # 是不开启IPV6致命错误检测：否
IPV6_ADDR_GEN_MODE=stable-privacy            # IPV6地址生成模型：stable-privacy [这只一种生成IPV6的策略]
NAME=ens33                    # 网卡物理设备名称
UUID=f47bde51-fa78-4f79-b68f-d5dd90cfc698    # 通用唯一识别码, 每一个网卡都会有, 不能重复, 否两台linux只有一台网卡可用
DEVICE=ens33                    # 网卡设备名称, 必须和 `NAME` 值一样
ONBOOT=no                        # 是否开机启动， 要想网卡开机就启动或通过 `systemctl restart network`控制网卡,必须设置为 `yes`

需要修改 BOOTPROTO 和 ONBOOT，并添加三行内容：

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="static" # 设置网卡引导协议为静态
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="enp0s3"
UUID="cde5c905-2d8d-4ca6-965e-30a9a9d5036b"
DEVICE="enp0s3"
ONBOOT="yes" # 设置网卡启动方式为 开机启动 并且可以通过 systemctl 控制网卡
IPADDR=10.0.0.6 # IP 地址
NETMASK=255.255.255.0 # 子网掩码
GATEWAY=10.0.0.1 # 网关地址

重启网卡：

systemctl restart network

再查看 ip addr 会发现修改成功：

enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:17:46:85 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.6/24 brd 10.0.0.255 scope global noprefixroute enp0s3

2. 安装 Docker

安装过程见 [00.CentOS 7 安装 Docker CE]()

开启

sudo yum install -y yum-utils

sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo

安装最新版本：

sudo yum install docker-ce docker-ce-cli containerd.io

安装指定版本：

yum list docker-ce --showduplicates | sort -r

sudo yum install docker-ce-<VERSION_STRING> docker-ce-cli-<VERSION_STRING> containerd.io

启动 Docker：

sudo systemctl start docker

3. 安装 Kubernetes 各组件

准备工作：

禁用 SELinux：
- 临时关闭：命令行执行 setenforce 0
- 永久关闭：修改 /etc/selinux/config 文件，将 SELINUX=enforcing 改为 SELINUX=disabled
关闭防火墙：systemctl disable firewalld && systemctl stop firewalld
关闭交换分区：swapoff -a && sed -i '/ swap / s/^/#/' /etc/fstab
更改 iptables 设置（后面两台虚拟机也要更改）：echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

在 /etc/docker/daemon.json 中加入：

{
  "registry-mirrors": ["https://registry.docker-cn.co"],
  "graph": "/mnt/docker-data",
  "storage-driver": "overlay"
}

安装 kubectl、kubeadm、kubelet：
```
yum -y install kubectl kubeadm kubelet
```
启动 Docker 和 kubelet（此时可能无法成功启动，在 kubeadm init 后会自动启动）：
```
systemctl enable docker && systemctl start docker
systemctl enable kubelet && systemctl start kubelet
```

4. 复制多台虚拟机

clone 两台虚拟机，并命名为 kube1.vm、kube2.vm：

hostnamectl set-hostname kube1.vm
hostnamectl set-hostname kube2.vm

将三台虚拟机的 host 写入到宿主机和每一台虚拟机：

vi /etc/hosts  

# 追加以下内容（ip 自行替换）
10.0.0.6 kube0.vm
10.0.0.7 kube1.vm
10.0.0.8 kube2.vm

5. 初始化 Master 节点

在 kube0.vm 上执行 kubeadm init。

由于你懂的的原因，国内可能需要指定库：

kubeadm init --image-repository registry.aliyuncs.com/google_containers

成功后在 kube0.vm 执行以下命令，以便于使用 kubectl：

# 非 root 用户
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# root 用户
export KUBECONFIG=/etc/kubernetes/admin.conf

6. 安装网络插件 flannel

执行：

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

如果报错：

The connection to the server raw.githubusercontent.com was refused - did you specify the right host or port?

可以把 yaml 文件拷贝到本地：

kubectl apply -f ./kube-flannel.yml

# 执行结果
podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds created

此时查看 node 状态应该是 Ready 了：

kubectl get nodes

NAME       STATUS   ROLES                  AGE   VERSION
kube0.vm   Ready    control-plane,master   79m   v1.20.1

7. 把多个 Node 加入集群

kube1.vm 和 kube2.vm 的准备工作：

修改 hosts 文件。
修改主机名。
切换到 root 用户。

使用 kube init 成功执行后输出的信息执行以下命令：

# 使用方法
kubeadm join --token <token> <control-plane-host>:<control-plane-port> --discovery-token-ca-cert-hash sha256:<hash>

kubeadm join 10.0.0.6:6443 --token 2i9xrk.97hiq25bjim8cn00 \
    --discovery-token-ca-cert-hash sha256:7e7a21a363bcc8510b239f9269cc2c0d806941f6e476b731ed7b3cd00405004c

过程中遇到的问题都解决后，可以执行 kubectl get all -A 来查看集群是否安装成功。

8. 安装过程中的 trouble shooting

8.1 Docker 启动失败

启动 Docker 时失败，返回 Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

按指示执行 systemctl status docker.service 中主要内容为：Main PID: 1661 (code=start-limited, status=1/FAILURE)

再执行 journalctl -u docker.service，报错内容显示是 JSON 文件解析错误，检查 /etc/docker/daemon.json 中的内容，确实是少了各逗号，修改 JSON 格式后，正确的格式：

{
    "registry-mirrors": ["https://registry.docker-cn.co"],
    "graph": "/mnt/docker-data",
    "storage-driver": "overlay"
}

再执行 systemctl enable docker && systemctl start docker，启动成功。

8.2 kubelet 启动失败

启动失败后，查看状态：

systemctl status kubelet.service

kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Mon 2021-01-04 15:58:22 CST; 9s ago
     Docs: https://kubernetes.io/docs/
  Process: 12965 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
 Main PID: 12965 (code=exited, status=255)

Jan 04 15:58:22 kube0.vm systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Jan 04 15:58:22 kube0.vm systemd[1]: Unit kubelet.service entered failed state.
Jan 04 15:58:22 kube0.vm systemd[1]: kubelet.service failed.

搜索得知需要先执行 kubeadm init

8.3 kubeadm init 无法拉取镜像

执行 kubeadm init 后，报错：

error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.20.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.20.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.20.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.20.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.2: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.4.13-0: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
    [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns:1.7.0: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1

这是无法访问 k8s.gcr.io 导致的，可以通过配置 proxy 或设置 --image-repository 来解决。

配置代理的方式未成功，这里用第二种方式：

[root@kube0 ~]# kubeadm init --image-repository registry.aliyuncs.com/google_containers

[init] Using Kubernetes version: v1.20.1
[preflight] Running pre-flight checks
    [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kube0.vm kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.0.0.6]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kube0.vm localhost] and IPs [10.0.0.6 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kube0.vm localhost] and IPs [10.0.0.6 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 15.011343 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.20" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node kube0.vm as control-plane by adding the labels "node-role.kubernetes.io/master=''" and "node-role.kubernetes.io/control-plane='' (deprecated)"
[mark-control-plane] Marking the node kube0.vm as control-plane by adding the taints [node-role.kubernetes.io/master:NoSchedule]
[bootstrap-token] Using token: 2i9xrk.97hiq25bjim8cn00
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.0.0.6:6443 --token 2i9xrk.97hiq25bjim8cn00 \
    --discovery-token-ca-cert-hash sha256:7e7a21a363bcc8510b239f9269cc2c0d806941f6e476b731ed7b3cd00405004c

8.4 安装 flannel 后报错

执行 kubectl get all -A 时发现 Pod kube-flannel-ds-ktm4q 状态是 CrashLoopBackOff。

查看该 Pod 日志：

kubectl logs kube-flannel-ds-ktm4q -n kube-system

# 错误日志
ERROR: logging before flag.Parse: I0104 16:51:09.442652       1 main.go:519] Determining IP address of default interface
ERROR: logging before flag.Parse: I0104 16:51:09.443306       1 main.go:532] Using interface with name enp0s3 and address 10.0.0.6
ERROR: logging before flag.Parse: I0104 16:51:09.443322       1 main.go:549] Defaulting external address to interface address (10.0.0.6)
W0104 16:51:09.443343       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ERROR: logging before flag.Parse: I0104 16:51:09.635774       1 kube.go:116] Waiting 10m0s for node controller to sync
ERROR: logging before flag.Parse: I0104 16:51:09.636082       1 kube.go:299] Starting kube subnet manager
ERROR: logging before flag.Parse: I0104 16:51:10.636292       1 kube.go:123] Node controller sync successful
ERROR: logging before flag.Parse: I0104 16:51:10.636327       1 main.go:253] Created subnet manager: Kubernetes Subnet Manager - kube0.vm
ERROR: logging before flag.Parse: I0104 16:51:10.636333       1 main.go:256] Installing signal handlers
ERROR: logging before flag.Parse: I0104 16:51:10.636398       1 main.go:391] Found network config - Backend type: vxlan
ERROR: logging before flag.Parse: I0104 16:51:10.636476       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
ERROR: logging before flag.Parse: E0104 16:51:10.636817       1 main.go:292] Error registering network: failed to acquire lease: node "kube0.vm" pod cidr not assigned

解决方法：编辑 /etc/kubernetes/manifests/kube-controller-manager.yaml，在 spec -> containers -> -command 下加入两行:

- --allocate-node-cidrs=true
- --cluster-cidr=10.244.0.0/16

然后删除 kube-controller-manager ，它会自动重启，配置就生效了。

8.5 Node 加入集群时报错

执行以下命令加入集群时报错：

kubeadm join 10.0.0.6:6443 --token 2i9xrk.97hiq25bjim8cn00 \
    --discovery-token-ca-cert-hash sha256:7e7a21a363bcc8510b239f9269cc2c0d806941f6e476b731ed7b3cd00405004c

# 错误信息
error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1

这是忘记执行这条命令了：

echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

8.6 Node 加入集群后，使用 kubectl 命令时报错

Node 在加入集群后，执行 kubectl 命令时会报错：

The connection to the server localhost:8080 was refused - did you specify the right host or port?

之前在 Master 节点时也有这个报错，随后以下命令后就可以使用了：

export KUBECONFIG=/etc/kubernetes/admin.conf

但是在 Worker 节点上执行此命令后，还是报错：

/etc/kubernetes/admin.conf: No such file or directory

这是因为 Worker 节点上没有配置文件，记得取消前面设置的环境变量：

unset KUBECONFIG

然后执行：

mkdir -p $HOME/.kube/
scp root@matet:/etc/kubernetes/admin.conf   $HOME/.kube/config

此时再执行 kubectl 命令就不会报错了。

8.7 集群中的工作节点没有 Role

执行命令：

kubectl get nodes

NAME       STATUS   ROLES                  AGE    VERSION
kube0.vm   Ready    control-plane,master   157m   v1.20.1
kube1.vm   Ready    <none>                 47m    v1.20.1
kube2.vm   Ready    <none>                 42m    v1.20.1

工作节点的 ROLES 为 none，不知道是默认这样还是有什么原因？

ROLES 只是节点的 label，和节点亲和性有关，比如 Pod 不会调度到 master 节点。参考：Assign Pods to Nodes using Node Affinity

添加 ROLE：

kubectl label node <node name> node-role.kubernetes.io/<role name>=<key - (any name)>

删除 ROLE：

kubectl label node <node name> node-role.kubernetes.io/<role name>-

9. 参考

Creating a cluster with kubeadm

LLLeon / Blog