canonical / microk8s

MicroK8s is a small, fast, single-package Kubernetes for datacenters and the edge.
https://microk8s.io
Apache License 2.0
8.38k stars 765 forks source link

VPN reset causes all pods to become not ready -> Unknown -> Ready #2096

Open Purneau opened 3 years ago

Purneau commented 3 years ago

We are running MicroK8s but our applications need a VPN to securely connect with each other. For this in some situations we have an Ubuntu VM with an OpenVPN connection active. On this VM Microk8s is installed.

However, if something changes in this VPN (like a reconnect) the pods in our cluster first become not ready and after a few minutes their status changes into Unknown. After approximately 20-30 minutes the issue slowly resolves itself and all pods become Ready again (and the application accessible).

Is this a known and/or expected issue?

I tried to run a microk8s inspect during this issue, but it takes very long to run. The result is attached. inspection-report-20210316_113756.tar.zip

Thanks for your help

AndrzejOlender commented 3 years ago

I have the same problem. OpenVPN connection/disconnection and problems with Unknown.

ktsakalozos commented 3 years ago

Hi @Purneau @AndrzejOlender, this behavior is partially expected. A new interface comes up and microk8s needs to reconfigure it self so it takes into account the change. The 20-30 minutes however is too long! I need to spend more time on that.

One way to stop this reconfiguration is to edit /var/snap/microk8s/current/args/kube-apiserver and set either the --advertise-address or the --bind-address according to [1] and then restart microk8s one last time with microk8s.stop; microk8s.start.

[1] https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/

Purneau commented 3 years ago

Hi @ktsakalozos, please forgive my ignorance, but based on the tarball: what would be the IP that I would need to set as --advertise-address or the --bind-address to minimise this behaviour?

ktsakalozos commented 3 years ago

@Purneau the most straight forward is to specify the interfaces where the API server should be available from. You can use the --bind-address for that. You can see the available interfaces with ip a.

devZer0 commented 3 years ago

i think that behaviour can be expected, imho it's how the code in microk8s currently works.

by chance i had a look on that part today and found this ticket while searching this tracker for apiservice-kicker ( because i think it does not handle things efficiently , for example it's constantly rebuilding a csr.conf to compare if ip adresses did change, which could be easier be done and without constantly hogging the disk in short intervals. why aren't ip adresses being compared , but something generated from it? think that could be done without disk-writes at all. disk write is precious, especially in times of flash storage, which degrades....)

it looks that apiservice-kicker will restart services when it detects changed certs, and certs will change, when there is ip adress change on the server. imho, that will trigger , when an additional ip adress appears or disappears on the system (hostname -I should see this), which should apply to temporary vpn connection...

have a look

/snap/microk8s/2074/apiservice-kicker

        csr_modified="$(produce_certs)"
        if [[ "$csr_modified" -eq "1" ]];
        then
            echo "CSR change detected. Reconfiguring the kube-apiserver"
            rm -rf .srl
            snapctl restart microk8s.daemon-etcd
            snapctl restart microk8s.daemon-containerd
            snapctl restart microk8s.daemon-apiserver
            snapctl restart microk8s.daemon-proxy
            snapctl restart microk8s.daemon-kubelet
            restart_attempt=$[$restart_attempt+1]

/snap/microk8s/current/actions/common/utils.sh           

get_ips() {
    local IP_ADDR="$($SNAP/bin/hostname -I)"
    if [[ -z "$IP_ADDR" ]]
    then
        echo "none"
    else
        if $SNAP/sbin/ifconfig cni0 &> /dev/null
        then
          CNI_IP="$($SNAP/sbin/ip -o -4 addr list cni0 | $SNAP/usr/bin/gawk '{print $4}' | $SNAP/usr/bin/cut -d/ -f1 | head -1)"
          local ips="";
          for ip in $IP_ADDR
          do
            [ "$ip" != "$CNI_IP" ] && ips+="${ips:+ }$ip";
          done
          IP_ADDR="$ips"
        fi
        echo "${IP_ADDR}"
    fi
}

            render_csr_conf() {
    # Render csr.conf.template to csr.conf.rendered

    local IP_ADDRESSES="$(get_ips)"

    cp ${SNAP_DATA}/certs/csr.conf.template ${SNAP_DATA}/certs/csr.conf.rendered
    if ! [ "$IP_ADDRESSES" == "127.0.0.1" ] && ! [ "$IP_ADDRESSES" == "none" ]
    then
        local ips='' sep=''
        local -i i=3
        for IP_ADDR in $(echo "$IP_ADDRESSES"); do
            ips+="${sep}IP.$((i++)) = ${IP_ADDR}"
            sep='\n'
        done
        "$SNAP/bin/sed" -i "s/#MOREIPS/${ips}/g" ${SNAP_DATA}/certs/csr.conf.rendered
    else
        "$SNAP/bin/sed" -i 's/#MOREIPS//g' ${SNAP_DATA}/certs/csr.conf.rendered
    fi
}

man hostname 

 -I, --all-ip-addresses
              Display all network addresses of the host. This option enumerates all configured addresses on all network interfaces.
devZer0 commented 3 years ago

also see https://github.com/ubuntu/microk8s/issues/1943

kevin-david commented 3 years ago

I also have this problem but the pods never come back - they get stuck in Unknown. in my case, my internet connection seems to be dying overnight leading to pod "failure"

i'm using DHCP with the host machine, and it's getting the same IPv4 address but I assume the v6 address is changing.

this network reset also causes pod restart to fail for me - image pull just hangs - unless i stop/start microk8s afterwards. I reported this behavior over here: https://github.com/ubuntu/microk8s/issues/1113#issuecomment-771790841. maybe this has something to do with why they get stuck in this state.

here is another inspection report from this after the pods got stuck in "unknown" status: inspection-report-20210407_123827.tar.gz

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

devZer0 commented 1 year ago

recent activity because stale bots suck

neoaggelos commented 1 year ago

Hi @devZer0

I'm going to attempt to give some context around this issue, how it relates to MicroK8s and why it should not be an issue any longer in newer MicroK8s versions (1.22+). Happy to discuss this further if this is still an issue for you or anyone else.

First of all, at the time when this issue was created, MicroK8s was mostly meant as a developer tool, meaning that the MicroK8s team integrated as many ease-of-use hooks to make sure it was friction-free for basic usage. One of them was ensuring that the kube-apiserver certs included all the IP addresses of the host machine. Developer machines do not have a static IP address, and they may even move between environments (e.g. home <--> office). Therefore, MicroK8s included a quick check (hostname -I) to check for changes and automatically refresh the certificates as needed.

One side-effect of this was that the API server had to be restarted for the new certificates to take effect. In MicroK8s 1.21 and earlier, this had a domino effect where it would take down kubelet, the container runtime, and this would kill all cluster workloads. This is the behavior that was explained in the original issue shows as Pods becoming Ready -> Unknown -> ...

This has not been the case for quite some time now. Starting from MicroK8s 1.22 onwards, kube-apiserver restarts do not kill the cluster workloads, so the issue described above would not occur at all. Further, there have been two additions to help mitigate this for deployments where it may be problematic:

To re-iterate, this should no longer restart workloads in MicroK8s 1.22 (released August 2021) or newer. Are you still affected by this issue? If so, please let's keep this discussion going, we're keen on seeing what we can improve on MicroK8s for this.

Also of note, there are quite a few duplicate GitHub issues for this specific problem, so it's only logical that some are missed and not updated. Apologies for this, I can assure it is in the best interest of the team to ensure that we improve on this going forward.

devZer0 commented 1 year ago

thank you for the in-depth explanation. very valuable/appreciated !

the problem is, i have seen to many issue tracker where issues getting closed by stale bots instead of being looked at or resolved.

this is frustrating for users reporting issues and for all contributors, which add information to the issues. they are getting scared away with this.

that's the reason why i started adding anti-stale-bot-posts.

i know that developer need to keep focus, but i think issue reports should never get closed because of inactivity.

neoaggelos commented 1 year ago

I agree with you in general. However, given the number of duplicates issues (for example, a recent one being https://github.com/canonical/microk8s/issues/3575), it would make sense to keep the issues that are "actionable", or the ones where the original poster is coming back with more information and giving feedback so that the issue is resolved.

To be honest, I very much agree with your sentiment: The measure of success for a project is not the number of closed issues, but rather the engagement with the community and the resolution of ongoing problems users are having.

D5Sammy commented 1 year ago

For the record: Using a version way more recent than v1.22 and Issue is still hapenning. Everything on my Microk8s get unknown when restarting a Wireguard interface. sudo systemctl restart wg-quick@wg

microk8s version MicroK8s v1.28.0 revision 5788

journalctl -n 1000 -u snap.microk8s.daemon-apiserver-kicker:

Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279491]: Signature ok Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279491]: subject=C = GB, ST = Canonical, L = Canonical, O = Canonical, OU = Canonical, CN = 127.0.0.1 Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279491]: Getting CA Private Key Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279507]: Signature ok Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279507]: subject=CN = front-proxy-client Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[279507]: Getting CA Private Key Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[1199479]: cert change detected. Restarting the cluster-agent Sep 07 12:47:51 rmbm1 microk8s.daemon-apiserver-kicker[1199479]: cert change detected. Reconfiguring the kube-apiserver Sep 07 12:47:51 rmbm1 sudo[279599]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void:/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/> Sep 07 12:47:51 rmbm1 sudo[279599]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:47:51 rmbm1 sudo[279599]: pam_unix(sudo:session): session closed for user root Sep 07 12:47:51 rmbm1 sudo[279605]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void:/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/> Sep 07 12:47:51 rmbm1 sudo[279605]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:47:51 rmbm1 sudo[279605]: pam_unix(sudo:session): session closed for user root Sep 07 12:47:58 rmbm1 sudo[280525]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/lib:/snap/microk8s/5788/lib/x86_64-linux-gnu:/snap/microk8s/5788/u> Sep 07 12:47:58 rmbm1 sudo[280525]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:47:59 rmbm1 sudo[280525]: pam_unix(sudo:session): session closed for user root Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285520]: Signature ok Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285520]: subject=C = GB, ST = Canonical, L = Canonical, O = Canonical, OU = Canonical, CN = 127.0.0.1 Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285520]: Getting CA Private Key Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285545]: Signature ok Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285545]: subject=CN = front-proxy-client Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[285545]: Getting CA Private Key Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[1199479]: cert change detected. Restarting the cluster-agent Sep 07 12:48:05 rmbm1 microk8s.daemon-apiserver-kicker[1199479]: cert change detected. Reconfiguring the kube-apiserver Sep 07 12:48:06 rmbm1 sudo[286473]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void:/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/> Sep 07 12:48:06 rmbm1 sudo[286473]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:48:06 rmbm1 sudo[286473]: pam_unix(sudo:session): session closed for user root Sep 07 12:48:06 rmbm1 sudo[286479]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void:/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/> Sep 07 12:48:06 rmbm1 sudo[286479]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:48:06 rmbm1 sudo[286479]: pam_unix(sudo:session): session closed for user root Sep 07 12:48:09 rmbm1 sudo[287318]: root : PWD=/var/snap/microk8s/5788 ; USER=root ; ENV=LD_LIBRARY_PATH=/snap/microk8s/5788/lib:/snap/microk8s/5788/usr/lib:/snap/microk8s/5788/lib/x86_64-linux-gnu:/snap/microk8s/5788/u> Sep 07 12:48:09 rmbm1 sudo[287318]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0) Sep 07 12:48:09 rmbm1 sudo[287318]: pam_unix(sudo:session): session closed for user root

This would be resolved by: sudo touch /var/snap/microk8s/current/var/lock/no-cert-reissue sudo microk8s stop sudo microk8s start

neoaggelos commented 12 months ago

Hi @D5Sammy

Do you also experience any workloads restarting when that happens? Can you see if the following fixes your problem:

sudo touch /var/snap/microk8s/current/var/lock/no-cert-reissue
stale[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.