Upgrade from 1.7.3-tectonic.3 to 1.7.5-tectonic-1 fails

nusx commented 6 years ago

Tectonic Version

1.7.5-tectonic.1

Environment

3 node cluster: 1 Master, 2 Workers
1 matchbox server running on a separate CoreOS VM.
Etcd was installed as a spearate service on the master node (the default installer option).
Using the LDAP authentication option
Round-Robin DNS entries for the worker nodes.

What hardware/cloud provider/hypervisor is being used with Tectonic? KVM

Expected Behavior

Automatic Upgrade from tectonic 1.7.3-tectonic.3 to 1.7.5-tectonic.1

Actual Behavior

The automatic upgrade process runs in two steps; to an intermedite release and from there to 1.7.5-tectonic-1. The first one succeeds, the upgrade to 1.7.5 fails and results in an unusable kubernetes cluster.

Journalctl on the master-node, reports etcd to be up, waiting to serve requests.
The k8s API fails. Journalctl -f show errors about not being able to register pods with the API. The error is "connection refused".
The cluster remains in this state, doesn't seem to recover automatically.

Reproduction Steps

Install tectonic-1.7.3-tectonic.3 with the corresponding darwin installer, selecting options as described in the Environment section.
Login to the tectonic-console and configure LDAP authentication.
When tectonic console offers the upgrade to 1.7.5-tectonic.1 click the "Start Upgrade" button.

Other Information

Previous version updates of the cluster were always successfull. We used to be running the cluster on the pre-production channel and attempted the 1.7.5 upgrade when it was offered there about a month ago. After that resulted in a broken cluster, we re-installed with 1.7.3 and switched to the production channel. Yesterday Nov. 12. 2017, we re-attempted the upgrade, from the production channel, with the same result.

kbrwn commented 6 years ago

@nusx To debug further, please provide the output of the following commands:

kubectl -n tectonic-system logs tectonic-channel-operator-<pod-id>
kubectl -n tectonic-system get appversion tectonic-cluster -o yaml
kubectl -n tectonic-system get crd channeloperatorconfigs.tco.coreos.com -o yaml

Thanks!

nusx commented 6 years ago

@kbrwn This is the output for the given commands on our current 1.7.3-tectonic.3 cluster. tectonic-issue-223.zip

bgroupe commented 6 years ago

@kbrwn Hello, we are also experiencing this issue. Our Env/Steps to Reproduce are nearly the same, however we are upgrading from 1.7.3-tectonic.4

Here is the output from the above commands: tectonic-installer-issue.tar.gz

Our install process remains blocked at Update Appversion Components > Update Appversion Kubernetes. Also noticed in the channel operator logs that this segment failed many times:

W1129 01:42:42.206781       1 updater.go:213] Failed to get the AppVersion for "kubernetes": Get https://10.x.x.x:443/apis/tco.coreos.com/v1/namespaces/tectonic-system/appversions/kubernetes: dial tcp 10.x.x.x:443: getsockopt: connection refused, will retry
E1129 01:42:42.504505       1 main.go:103] Failed to determine whether the update check is triggered: Get https://10.x.x.x:443/apis/tco.coreos.com/v1/namespaces/tectonic-system/channeloperatorconfigs/default: dial tcp 10.x.x.x:443: getsockopt: connection refused

Please advise.

kbrwn commented 6 years ago

@bgroupe @nusx The issue is due to the kube-apiserver not responding to requests. This could be due to the apiserver being down or another type of network partition. Could you try deleting the tectonic-channel-operator pod and attempting the update again?

kubectl -n tectonic-system get pods | grep tectonic-channel
tectonic-channel-operator-1030110693-nx4jp               1/1       Running   0          3m

kubectl -n tectonic-system delete pod tectonic-channel-operator-1030110693-g2l6b
pod "tectonic-channel-operator-1030110693-g2l6b" deleted

Please provide the same files again after doing so:

kubectl -n tectonic-system logs tectonic-channel-operator-<pod-id>
kubectl -n tectonic-system get appversion tectonic-cluster -o yaml
kubectl -n tectonic-system get crd channeloperatorconfigs.tco.coreos.com -o yaml

nusx commented 6 years ago

@kbrwn Our cluster has been reinstalled with 1.7.9-tectonic.1. @bgroupe Our upgrade was actually failing when upgrading from the same version (1.7.3-tectonic.4), but because we started our upgrade process from version 1.7.3-tectonic.3 the cluster was attempting to upgrade in two steps. The first one, from 1.7.3-tectonic.3 -> 1.7.3-tectonic.4 seemed successful (according to the status beeing reported in the UI), then during the upgrade from 1.7.3-tectonic.4 the cluster became unavailable as reported.

bgroupe commented 6 years ago

@nusx I see, ty. @kbrwn Deleted the channel operator pod and restarted. However the installer fails on a much earlier step, Update Tectonic Operators > Update deployment kube-version-operator

Here is the output from the above commands after failure: tectonic-upgrade-issue-2.tar.gz

knweiss commented 6 years ago

I have the same problem upgrading from 1.7.9-tectonic.2 to 1.7.9-tectonic.3:

Still running in 1.7.9-tectonic.2 I had modified the tectonic-channel-operator as described in #235. I.e. I added the proxy env variables to the tectonic-channel-operator deployment. Afterwards, web access of the channel operator worked fine in 1.7.9-tectonic.2.

However, now during the tectonic update a new 1.7.9-tectonic.3 tectonic-channel-operator deployment appears (again without the required proxy env vars) via the tectonic-channel-operator pod:

$ kubectl -n tectonic-system logs  -f tectonic-channel-operator-765513156-8djw5                                                                                                                                                        
[...]
I1220 12:13:47.936285       1 leaderelection.go:184] successfully acquired lease tectonic-system/tectonic-channel-operator
I1220 12:13:53.041044       1 types.go:133] No handler for version "1.7.9-tectonic.2", skip
I1220 12:13:56.053904       1 main.go:508] Tectonic Channel Operator starts watching updates from core update
I1220 12:14:10.460589       1 main.go:406] Updating to target TectonicVersion "1.7.9-tectonic.3"
I1220 12:14:20.086288       1 updater.go:159] Updating deployment "tectonic-channel-operator"
I1220 12:14:20.579526       1 main.go:516] Received signal: terminated
I1220 12:14:20.579564       1 main.go:523] Tectonic Channel Operator exiting...

Unfortunately, I still have not found out how to influence the new tectonic-channel-operator deployment of 1.7.9-tectonic.3 started at timestamp 12:14:20.086288 in the example above. The new deployment always appears without the required env-vars, creates a new ReplicaSet and thus a tectonic-channel-operator pod without the required web proxy env-vars...

If I try to edit the tectonic-channel-operator deployment, as before, I always seem to be editing the old .2 version and not the new .3 target version.

kubectl -n tectonic-system get appversion tectonic-cluster -o yaml indicates that I'm in the middle of the transition phase:

[...]
status:
  currentVersion: 1.7.9-tectonic.2
  paused: false
  targetVersion: 1.7.9-tectonic.3
  taskStatuses:
  - name: Update deployment tectonic-channel-operator
    reason: ""
    state: Running
    type: operator
[...]

Does anyone have a hint how to get the web proxy env vars into the new tectonic-channel-operator deployment?

If only there were web proxy env variables in the tectonic-config Config Map! (Such as CLUSTERID)

(This missing web proxy support could be a show stopper for us.)

knweiss commented 6 years ago

FWIW: This is still a show stopper for me. Any hint/idea would be appreciated.

seasurfpete commented 6 years ago

Hi @knweiss , I am at exactly the same point as you going from 1.7.9.tec.1 to 1.7.9.tec.4 The tectonic-channel-operator keeps updating. What I do seem to have is a no_proxy issue. I'd added proxy values to get auto update discovery working and all the pods talking back to k8s.

with no env vars I get a gateway timeout talking to k8s API at 172.22.0.1:443, adding no_proxy for 172.22.0.1 will get me into the loop of an updating deployment that starts back up without envs so fails at timeout.

did you ever find a way of getting into the new deployment?

I'm wondering if editing the 'appversion tectonic-cluster' and setting channel-operator to done (i don't know the actual completed flag) would make it skip to the next section, we'd need to know the correct image version for the container tho.

knweiss commented 6 years ago

@seasurfpete No, it is still not working. ATM our test project is on hold because of this (also because of the CoreOS takeover situation).

(FWIW: You can see my proxy settings in #235. I've added e.g. the pod- and service ip ranges to no_proxy.)

seasurfpete commented 6 years ago

So @knweiss I actually got this working on Monday!! I successfully upgraded from 1.7.9.t.1 to 1.7.9.t.4. I was looking at something else with a colleague and noticed that the actual docker container for channel-operator was carrying http_proxy vars that were different to my system proxy vars but none were specified on the deployment. It was still having issues talking to k8s API on 172.22.0.1 - Gateway Timeout. (i.e hitting the proxy)

I did some looking around and found a configMap called http-proxy-env-config that seemed to be referenced but I can't find out where as it wasn't in the deployment that i saw. I added 172.22.0.1 to no_proxy in there and the upgrade just happened! it worked.

I then got carried away and changed my release chain to 1.8 Production and attempted a 1.8.4.t.3 upgrade. This has managed to upgrade the system 1.8.4.t.0 AND rebooted all the nodes, and came back up, it is now failing doing the same thing going upto 1.8.4.t.3 with the channel controller in a loop but this time i get some TLS errors getting https://tectonic.update.core-os.net/_ah/api/update/v1/public/packages: dial tcp 34.192.32.119:443 but it seems to continue, and then updates the deployment, restarts without proxy envs at all now and can't get to the above address.

mrjoshuak commented 6 years ago

I have a similar if not identical problem ( upgrading/switching channels: 1.7.14-tectonic.1 ➝ 1.8.9-tectonic.1 ). My problem seems, at it's root, to be a failure of the upgrade process to handle Third Party Resource (TPR) to Custom Resource Definition (CRD) migration.

As I've dug into figuring out how to solve this problem, so we can deploy an essential upgrade to our cluster to enable CronJobs, I've discovered that there are countless possibly related, possibly not related issues across a number of different repos. This investigation quickly explodes into an intractable research problem as links and theories expand from each thread quadratically (or more), and this is even before I can start learning how and then implementing a fix.

So can any of you (@nusx, @kbrwn, @bgroupe, @knweiss, @seasurfpete, @esselfour) confirm or eliminate TPR to CRD migration as the root issue?

Thank you very much in advance for your insights.

*As an aside I'm quite concerned that this issue is not handled correctly by the tectonic upgrade process to begin with, and even more concerning that in failing to upgrade tectonic doesn't quickly give up and return to a clean non-upgraded state, like for example we'd expect from a failed coreos upgrade.

coreos / tectonic-forum