azure: TPR 404 on prometheues

philips commented 7 years ago

Workaround was:

sudo bash /opt/tectonic/tectonic.sh kubeconfig /opt/tectonic/tectonic

Error logs:

Mar 20 22:34:35 tectonic-master-000002 bash[2041]: [  331.108351] bootkube[5]:         Pod Status: kube-controller-manager        Running
Mar 20 22:34:35 tectonic-master-000002 bash[2041]: [  331.108610] bootkube[5]: All self-hosted control plane components successfully started
Mar 20 22:34:35 tectonic-master-000002 bash[3102]: Waiting for Kubernetes API...
Mar 20 22:34:40 tectonic-master-000002 bash[3102]: Waiting for Kubernetes components...
Mar 20 22:34:55 tectonic-master-000002 bash[3102]: Creating Tectonic Namespace
Mar 20 22:34:55 tectonic-master-000002 bash[3102]: Creating Initial Roles
Mar 20 22:34:55 tectonic-master-000002 bash[3102]: Creating Tectonic ConfigMaps
Mar 20 22:34:55 tectonic-master-000002 bash[3102]: Creating Tectonic Secrets
Mar 20 22:34:56 tectonic-master-000002 bash[3102]: Creating Tectonic Identity
Mar 20 22:34:56 tectonic-master-000002 bash[3102]: Creating Tectonic Console
Mar 20 22:34:56 tectonic-master-000002 bash[3102]: Creating Tectonic Monitoring
Mar 20 22:34:56 tectonic-master-000002 bash[3102]: Waiting for third-party resource definitions...
Mar 20 22:35:16 tectonic-master-000002 bash[3102]: Failed to create monitoring/prometheus-k8s.json (got 404):
Mar 20 22:35:16 tectonic-master-000002 bash[3102]: {"apiVersion":"monitoring.coreos.com/v1alpha1","kind":"Prometheus","metadata":{"name":"k8s","namespace":"tectonic-system","selfLink":"/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses/k8s","uid":"7dc23af4-0dbd-11e7-b41c-000d3a1648ea","resourceVersion":"1216","creationTimestamp":"2017-03-20T22:35:16Z","labels":{"prometheus":"k8s"}},"spec":{"replicas":1,"resources":{"limits":{"cpu":"400m","memory":"2000Mi"},"requests":{"cpu":"200m","memory":"1500Mi"}},"serviceAccountName":"prometheus-k8s","version":"v1.5.2"}}
Mar 20 22:35:16 tectonic-master-000002 systemd[1]: tectonic.service: Main process exited, code=exited, status=1/FAILURE
Mar 20 22:35:16 tectonic-master-000002 systemd[1]: Failed to start Bootstrap a Tectonic cluster.
Mar 20 22:35:16 tectonic-master-000002 systemd[1]: tectonic.service: Unit entered failed state.
Mar 20 22:35:16 tectonic-master-000002 systemd[1]: tectonic.service: Failed with result 'exit-code'.

Quentin-M commented 7 years ago

For some reason, the TPR waiting loop exited early (or the API server claimed the TPR was available but then flaked). We could add some retries or increase a bit the delay.

s-urbaniak commented 7 years ago

fwiw I observed this too while playing with the azure platform

philips commented 7 years ago

@Quentin-M slow link between quay and azure?

Quentin-M commented 7 years ago

@philips No. After the TPR got created, the script waits until the API server recognizes its existences (it is async), by looking for a 200 OK response rather than a 404 Not Found on the GET endpoint of the TPR. Only then, it will try to create the monitoring/prometheus-k8s.json resource against it. It appears that the API acknowledge the existence of the TPR but then sometimes still fails. This is the behavior we have on upstream Tectonic too.

Just to confirm, you are running a single instance of the API Server, right? Because of the kubeconfig uses a LB endpoint of the API server and you are running multiple API Servers, you might get confirmation from one of the API Server, while the others are still behind.

As a temporary workaround if this is a flake from the API Server, we could try to add a small delay on at the end of the wait_for_tpr function.

philips commented 7 years ago

This is on multi-API server. We must have --etcd-quorum-read=true.

Quentin-M commented 7 years ago

That's why. The assumption was that we only had one API server. This issue is therefore ALSO affecting upstream Tectonic.

philips commented 7 years ago

@Quentin-M gah, ok. Can you fix it here?

s-urbaniak commented 7 years ago

indeed, using etcd3 k8s defaults to the no-quorum-read etcd client, see https://github.com/kubernetes/kubernetes/blob/v1.5.4/pkg/storage/storagebackend/factory/etcd3.go#L58-L61.

Added --etcd-quorum-read=true, but still getting the following error consistently:

Mar 22 09:54:14 tectonic-master-000000 bash[15455]: ++ curl -sNL --cacert /tmp/tmp.KfhHd7nvQb --cert /tmp/tmp.DcuZjAOyIG --key /tmp/tmp.L5W4pB2PDN -o /dev/null --write-out '%{http_code}\n' -H 'Content-Type: application/yaml' '-dapiVersion: v1
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: kind: Namespace
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: metadata:
Mar 22 09:54:14 tectonic-master-000000 bash[15455]:   name: tectonic-system' https://sur-k8s.azure.ifup.org:443/api/v1/namespaces
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: + STATUS=000
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: + rm -f /tmp/tmp.KfhHd7nvQb /tmp/tmp.DcuZjAOyIG /tmp/tmp.L5W4pB2PDN
Mar 22 09:54:14 tectonic-master-000000 systemd[1]: tectonic.service: Main process exited, code=exited, status=7/NOTRUNNING
Mar 22 09:54:14 tectonic-master-000000 systemd[1]: Failed to start Bootstrap a Tectonic cluster.

I will try to add --retry and --retry-delay to the create_resource curl.

s-urbaniak commented 7 years ago

adding --retry and --retry-delay helped:

null_resource.tectonic: Still creating... (3m50s elapsed)
null_resource.tectonic: Creation complete

Apply complete! Resources: 102 added, 0 changed, 0 destroyed.

Quentin-M commented 7 years ago

@sozercan hit the issue today as of master. Re-opening.

Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Identity
Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Console
Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Monitoring
Mar 31 22:22:12 tectonic-master-000001 bash[7907]: Waiting for third-party resource definitions...
Mar 31 22:22:22 tectonic-master-000001 bash[7907]: Failed to create monitoring/prometheus-k8s.json (got 404):
Mar 31 22:22:22 tectonic-master-000001 bash[7907]: {"apiVersion":"monitoring.coreos.com/v1alpha1","kind":"Prometheus","metadata":{"name":"k8s","namespace":"tectonic-system","selfLink":"/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses/k8s","uid":"8309ff63-1660-11e7-a781-000d3a1679e3","resourceVersion":"1003","creationTimestamp":"2017-03-31T22:22:22Z","labels":{"prometheus":"k8s"}},"spec":{"replicas":1,"resources":{"limits":{"cpu":"400m","memory":"2000Mi"},"requests":{"cpu":"200m","memory":"1500Mi"}},"serviceAccountName":"prometheus-k8s","version":"v1.5.2"}}
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Main process exited, code=exited, status=1/FAILURE
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: Failed to start Bootstrap a Tectonic cluster.
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Unit entered failed state.
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Failed with result 'exit-code'.

s-urbaniak commented 7 years ago

I just hit this too.

To be more precise, now the 404 does not happen while waiting for the TPR resource definition to be ready (which is fixed) but while we create the TPR afterwards.

A more detailed trace:

Apr 04 09:25:37 sur-master-0 bash[6898]: + echo 'Waiting for third-party resource definitions...'
Apr 04 09:25:37 sur-master-0 bash[6898]: Waiting for third-party resource definitions...
Apr 04 09:25:37 sur-master-0 bash[6898]: + true
Apr 04 09:25:37 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:37 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:38 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:38 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:38 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:38 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:38 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:43 sur-master-0 bash[6898]: + true
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:43 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:43 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:43 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:43 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:48 sur-master-0 bash[6898]: + true
Apr 04 09:25:48 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:48 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:49 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:49 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:49 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:49 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:49 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:54 sur-master-0 bash[6898]: + true
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:54 sur-master-0 bash[6898]: + local got_name=prometheus.monitoring.coreos.com
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: + local status=200
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' prometheus.monitoring.coreos.com == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 200 == 200 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + break
Apr 04 09:25:54 sur-master-0 bash[6898]: + create_resource json monitoring/prometheus-k8s.json apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: +++ cat tectonic/monitoring/prometheus-k8s.json
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}\n' -H 'Content-Type: application/json' '-d{
Apr 04 09:25:54 sur-master-0 bash[6898]:   "apiVersion": "monitoring.coreos.com/v1alpha1",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "kind": "Prometheus",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "metadata": {
Apr 04 09:25:54 sur-master-0 bash[6898]:     "name": "k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "namespace": "tectonic-system",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "labels": {
Apr 04 09:25:54 sur-master-0 bash[6898]:       "prometheus": "k8s"
Apr 04 09:25:54 sur-master-0 bash[6898]:     }
Apr 04 09:25:54 sur-master-0 bash[6898]:   },
Apr 04 09:25:54 sur-master-0 bash[6898]:   "spec": {
Apr 04 09:25:54 sur-master-0 bash[6898]:     "replicas": 1,
Apr 04 09:25:54 sur-master-0 bash[6898]:     "version": "v1.5.2",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "serviceAccountName": "prometheus-k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "resources": {
Apr 04 09:25:54 sur-master-0 bash[6898]:       "limits": {
Apr 04 09:25:54 sur-master-0 bash[6898]:         "cpu": "400m",
Apr 04 09:25:54 sur-master-0 bash[6898]:         "memory": "2000Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]:       },
Apr 04 09:25:54 sur-master-0 bash[6898]:       "requests": {
Apr 04 09:25:54 sur-master-0 bash[6898]:         "cpu": "200m",
Apr 04 09:25:54 sur-master-0 bash[6898]:         "memory": "1500Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]:       }
Apr 04 09:25:54 sur-master-0 bash[6898]:     }
Apr 04 09:25:54 sur-master-0 bash[6898]:   }
Apr 04 09:25:54 sur-master-0 bash[6898]: }' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: + STATUS=404
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 200 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 201 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 409 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + echo -e 'Failed to create monitoring/prometheus-k8s.json (got 404): '
Apr 04 09:25:54 sur-master-0 bash[6898]: Failed to create monitoring/prometheus-k8s.json (got 404):
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ cat tectonic/monitoring/prometheus-k8s.json
Apr 04 09:25:54 sur-master-0 bash[6898]: + curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -H 'Content-Type: application/json' '-d{
Apr 04 09:25:54 sur-master-0 bash[6898]:   "apiVersion": "monitoring.coreos.com/v1alpha1",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "kind": "Prometheus",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "metadata": {
Apr 04 09:25:54 sur-master-0 bash[6898]:     "name": "k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "namespace": "tectonic-system",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "labels": {
Apr 04 09:25:54 sur-master-0 bash[6898]:       "prometheus": "k8s"
Apr 04 09:25:54 sur-master-0 bash[6898]:     }
Apr 04 09:25:54 sur-master-0 bash[6898]:   },
Apr 04 09:25:54 sur-master-0 bash[6898]:   "spec": {
Apr 04 09:25:54 sur-master-0 bash[6898]:     "replicas": 1,
Apr 04 09:25:54 sur-master-0 bash[6898]:     "version": "v1.5.2",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "serviceAccountName": "prometheus-k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]:     "resources": {
Apr 04 09:25:54 sur-master-0 bash[6898]:       "limits": {
Apr 04 09:25:54 sur-master-0 bash[6898]:         "cpu": "400m",
Apr 04 09:25:54 sur-master-0 bash[6898]:         "memory": "2000Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]:       },
Apr 04 09:25:54 sur-master-0 bash[6898]:       "requests": {
Apr 04 09:25:54 sur-master-0 bash[6898]:         "cpu": "200m",
Apr 04 09:25:54 sur-master-0 bash[6898]:         "memory": "1500Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]:       }
Apr 04 09:25:54 sur-master-0 bash[6898]:     }
Apr 04 09:25:54 sur-master-0 bash[6898]:   }
Apr 04 09:25:54 sur-master-0 bash[6898]: }' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: {
Apr 04 09:25:54 sur-master-0 bash[6898]:   "kind": "Status",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "apiVersion": "v1",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "metadata": {},
Apr 04 09:25:54 sur-master-0 bash[6898]:   "status": "Failure",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "message": "the server could not find the requested resource",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "reason": "NotFound",
Apr 04 09:25:54 sur-master-0 bash[6898]:   "details": {},
Apr 04 09:25:54 sur-master-0 bash[6898]:   "code": 404
Apr 04 09:25:54 sur-master-0 bash[6898]: }+ exit 1
Apr 04 09:25:54 sur-master-0 bash[6898]: + rm -f /tmp/tmp.srQHwtBFCU /tmp/tmp.4BwFIypJLQ /tmp/tmp.NQe0ILJy2U

s-urbaniak commented 7 years ago

As discussed with @alexsomesan the current plan is to move tectoni.sh to kubectl based logic to avoid all of the above problems.

Quentin-M commented 7 years ago

Sounds straightforward. Does kubectl have retry/wait logic in case a resource type is missing?

alexsomesan commented 7 years ago

Seems to not really have any, but just by using kubectl will cleanup the scripts a bit. Implementing retries around kubectl seems a little bit more manageable afterwards.

s-urbaniak commented 7 years ago

xref https://github.com/kubernetes/kubernetes/issues/29002 (possibly related)

coreos / tectonic-installer

azure: TPR 404 on prometheues #86