coreos / tectonic-forum

Apache License 2.0
30 stars 9 forks source link

Bare Metal Upgrade is hung #296

Open joerawr opened 6 years ago

joerawr commented 6 years ago

Issue Report Template

Tectonic Version

Tectonic 1.8.4-tectonic.4 ➝ 1.8.9-tectonic.1

Environment

What hardware/cloud provider/hypervisor is being used with Tectonic?

Bare Metal

Expected Behavior

Upgrade should complete

Actual Behavior

Upgrade loops at Update node-agent

tectonicloopingupgrade

Reproduction Steps

  1. Press the upgrade button
  2. ...

Other Information

Let me know what other logs I can collect to help the upgrade complete.

Feature Request

Environment

What hardware/cloud provider/hypervisor is being used with Tectonic?

Bare Metal Dell servers for the workers and vmware vms for the masters.

Desired Feature

Other Information

IvanCherepov commented 6 years ago

Could you provide failed node-agent pod logs?

joerawr commented 6 years ago

Sure. I'm not sure which one(s) are failed, so here is all five.

tectonic-system node-agent-dj2l2 1/1 Running 4 57d tectonic-system node-agent-fplfv 1/1 Running 2 57d tectonic-system node-agent-mlf6t 1/1 Running 2 57d tectonic-system node-agent-q4q9s 1/1 Running 4 57d tectonic-system node-agent-szlmp 1/1 Running 4 57d

# kubectl logs node-agent-dj2l2 -n tectonic-system
I0509 02:49:41.182278       1 main.go:78] starting node agent, watching node: worker3.preprod.com
I0509 03:00:19.770215       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:31.045639       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:32.132094       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:41.189267       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:42.254500       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:52.132958       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
W0509 04:00:25.107193       1 reflector.go:326] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: watch of *v1.Node ended with: very short watch: github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Unexpected watch close - watch lasted less than a second and no items received
E0509 04:00:26.108665       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dworker3.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0509 04:00:27.110001       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dworker3.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused

# kubectl logs  -n tectonic-system node-agent-fplfv
I0509 02:41:07.031962       1 main.go:78] starting node agent, watching node: vmaster0.preprod.com
I0509 03:00:20.772975       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:30.446394       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:33.649097       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:37.043777       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:43.669826       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:53.686511       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:01:07.044255       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
E0509 04:00:25.145760       1 reflector.go:307] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to watch *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dvmaster0.preprod.com&watch=true: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0509 04:00:26.147005       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dvmaster0.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0509 04:00:27.148166       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dvmaster0.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused

# kubectl logs  -n tectonic-system node-agent-mlf6t
I0509 04:02:50.066717       1 main.go:78] starting node agent, watching node: vmaster1.preprod.com

# kubectl logs  -n tectonic-system node-agent-q4q9s
I0509 02:43:48.652562       1 main.go:78] starting node agent, watching node: worker0.preprod.com
I0509 03:00:17.571822       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:28.227165       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:29.115090       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:31.413663       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0

# kubectl logs  -n tectonic-system node-agent-szlmp
I0509 02:50:01.962067       1 main.go:78] starting node agent, watching node: worker1.preprod.com
I0509 03:00:18.595144       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:27.569860       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
I0509 03:00:31.966333       1 agent.go:175] pulling image: quay.io/coreos/hyperkube:v1.8.7_coreos.0
W0509 04:00:25.047181       1 reflector.go:326] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: watch of *v1.Node ended with: very short watch: github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Unexpected watch close - watch lasted less than a second and no items received
E0509 04:00:26.047631       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dworker1.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0509 04:00:27.048119       1 reflector.go:199] github.com/coreos-inc/kube-version-operator/cmd/node-agent/main.go:103: Failed to list *v1.Node: Get https://10.3.0.1:443/api/v1/nodes?fieldSelector=metadata.name%3Dworker1.preprod.com: dial tcp 10.3.0.1:443: getsockopt: connection refused
IvanCherepov commented 6 years ago

I didn't find any errors in node-agent logs. Can you please provide output of kubectl describe appversion kubernetes -n tectonic-system?

joliveirinha commented 5 years ago

I have exactly the same problem, but from Tectonic 1.9.6-tectonic.1 ➝ 1.9.6-tectonic.2.

Firstly, it got stuck on prewarm images and I followed this: https://support.coreos.com/hc/en-us/articles/360006498434-Tectonic-update-stuck-on-pre-warm-cache

Now it is stuck on upgrade node-agent.

The output of the "kubectl describe appversion kubernetes -n tectonic-system" is the following:

$ kubectl describe appversion kubernetes -n tectonic-system
Name:         kubernetes
Namespace:    tectonic-system
Labels:       managed-by-channel-operator=true
Annotations:  <none>
API Version:  tco.coreos.com/v1
Kind:         AppVersion
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-07-26T13:48:44Z
  Generation:          0
  Resource Version:    53571333
  Self Link:           /apis/tco.coreos.com/v1/namespaces/tectonic-system/appversions/kubernetes
  UID:                 9d0184c8-90da-11e8-ba7e-1a493cb07586
Spec:
  Desired Version:  1.9.6+tectonic.2
  Paused:           false
Status:
  Current Version:  1.9.6+tectonic.1
  Paused:           false
  Target Version:   1.9.6+tectonic.2
  Task Statuses:
    Name:    Update pod-checkpointer
    Reason:
    State:   Completed
    Type:
    Name:    Update node-agent
    Reason:
    State:   Running
    Type:
    Name:    Update kube-apiserver
    Reason:
    State:   NotStarted
    Type:
    Name:    Update kube-scheduler
    Reason:
    State:   NotStarted
    Type:
    Name:    Update kube-controller-manager
    Reason:
    State:   NotStarted
    Type:
    Name:    Update kube-proxy
    Reason:
    State:   NotStarted
    Type:
    Name:    Update Node Updater
    Reason:
    State:   NotStarted
    Type:
    Name:    Update tectonic-identity
    Reason:
    State:   NotStarted
    Type:
    Name:    Update kube-dns
    Reason:
    State:   NotStarted
    Type:
    Name:    Update tectonic-console
    Reason:
    State:   NotStarted
    Type:
    Name:    Update tectonic-identity-api
    Reason:
    State:   NotStarted
    Type:
    Name:    Update tectonic-stats-emitter
    Reason:
    State:   NotStarted
    Type:
    Name:    Update tectonic-ingress-controller
    Reason:
    State:   NotStarted
    Type:
    Name:    Update kube-flannel
    Reason:
    State:   NotStarted
    Type:
    Name:    prewarm-container-image-cache
    Reason:
    State:   Completed
    Type:
Events:      <none>