cnrancher / autok3s

Run K3s Everywhere
https://www.suse.com
Apache License 2.0
741 stars 76 forks source link

[BUG] Error writing node IP when join node, The status has been constantly in Upgrading #648

Closed xuzheng0017 closed 7 months ago

xuzheng0017 commented 8 months ago

Describe the bug Error writing node IP when join node, The status has been constantly in Upgrading.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots image

Environments (please complete the following information):

Additional context

time="2023-12-05T11:59:34+08:00" level=info msg="the 4/5 time tring to ssh to 74.48.115.18:22 with user root"
https://mirrors.sonic.net/epel/7/x86_64/repodata/d526a7fd5dbf31d263829b2d144a41ca6126a8ead6d8a75fe0da87b1f250efb1-primary.sqlite.bz2: [Errno 14] HTTPS Error 404 - Not Found
Trying other mirror.
To address this issue please refer to the below wiki article
https://wiki.centos.org/yum-errors
If above article doesn't help to resolve this issue please use https://bugs.centos.org/.
http://mirror.tornadovps.com/pub/epel/7/x86_64/repodata/d526a7fd5dbf31d263829b2d144a41ca6126a8ead6d8a75fe0da87b1f250efb1-primary.sqlite.bz2: [Errno 14] HTTP Error 404 - Not Found
Trying other mirror.
time="2023-12-05T12:00:04+08:00" level=info msg="the 5/5 time tring to ssh to 74.48.115.18:22 with user root"
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.web-ster.com
* epel: lolhost.mm.fcix.net
* extras: mirrors.oit.uci.edu
* updates: mirror.sfo12.us.leaseweb.net
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: ix-denver.mm.fcix.net
* epel: mirrors.ocf.berkeley.edu
* extras: mirrors.oit.uci.edu
* updates: mirror.sfo12.us.leaseweb.net
Command line error: no such option: --refresh
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.web-ster.com
* epel: mirrors.ocf.berkeley.edu
* extras: mirrors.oit.uci.edu
* updates: ix-denver.mm.fcix.net
Package yum-utils-1.1.31-54.el7_8.noarch already installed and latest version
Nothing to do
Loaded plugins: fastestmirror
JacieChao commented 8 months ago

Thanks for your feedback.

Are all cluster nodes CentOS 7.9 or only the newly added worker node is using CentOS 7.9? It seemed that the new worker node could not fetch from the rpm mirror. Could you please provide the join node parameters and the full log with joined the new node?

xuzheng0017 commented 8 months ago

vps-regtech.log This is all the logs of this cluster. all cluster nodes CentOS 7.9, When I added a new batch of nodes, an IP address was written incorrectly. The cluster is currently in an Upgraded state

JacieChao commented 8 months ago

Is the joining node action stuck at the last line of the log you provide? It seems like AutoK3s can't access the node 74.48.115.18 by SSH tunnel. Is this node IP the incorrect one?

xuzheng0017 commented 8 months ago

Q1: Yes Q2: 74.48.115.18 is wrong, I made a mistake in writing. emmmmmm...

JacieChao commented 8 months ago

Sometimes the native provider can't catch the join error correctly. When this situation happens, the cluster's status will be Upgrade forever. I will find out if there's a workaround.

xuzheng0017 commented 8 months ago

Okay, I'll rebuild the cluster. Thank you for your answer. Best wishes to you.

JacieChao commented 8 months ago

@xuzheng0017 There's no need to rebuild the cluster. The K3s cluster won't impact by the AutoK3s cluster status.

xuzheng0017 commented 8 months ago

Okay, but I want to join other nodes to the page without any options.

JacieChao commented 8 months ago

The workaround below may help you:

JacieChao commented 8 months ago

The bug is relative to the wrong catch of error in defer function. Will fix this in the next version.

xuzheng0017 commented 8 months ago

I have encountered another problem: When I deleted a node in kube-explorer and then returned to the cluster page, the number of nodes did not decrease. I added the node again with the command:

81d5d17a77de:/home/shell # autok3s join -p native --name vps-cargogo --ip xx.xx.xx.xx --ssh-user root --ssh-key-path /root/.autok3s/vps-cargogo/id_rsa --worker-ips xx.xx.xx.xx
time="2023-12-06T14:53:03+08:00" level=info msg="[native] begin to join nodes for vps-cargogo..."
time="2023-12-06T14:53:03+08:00" level=info msg="[native] executing join k3s node logic"
time="2023-12-06T14:53:03+08:00" level=info msg="[native] successfully executed join k3s node logic"
time="2023-12-06T14:53:03+08:00" level=info msg="[native] successfully executed join logic"
xuzheng0017 commented 8 months ago

Can only use commands on nodes to rejoin?

JacieChao commented 8 months ago

Yes. AutoK3s can't synchronize your operation because the node was removed manually and didn't synchronize the AutoK3s database. So you can't rejoin the node by AutoK3s because the node is already in the cluster by AutoK3s side. The workaround is to add the node back by K3s CLI manually for now.

JacieChao commented 7 months ago

tested with v0.9.2-rc1. AutoK3s can return the correct status of the cluster if join nodes fail. Close as complete