Closed philips closed 7 years ago
For some reason, the TPR waiting loop exited early (or the API server claimed the TPR was available but then flaked). We could add some retries or increase a bit the delay.
fwiw I observed this too while playing with the azure platform
@Quentin-M slow link between quay and azure?
@philips No. After the TPR got created, the script waits until the API server recognizes its existences (it is async), by looking for a 200 OK
response rather than a 404 Not Found
on the GET
endpoint of the TPR. Only then, it will try to create the monitoring/prometheus-k8s.json
resource against it. It appears that the API acknowledge the existence of the TPR but then sometimes still fails. This is the behavior we have on upstream Tectonic too.
Just to confirm, you are running a single instance of the API Server, right? Because of the kubeconfig uses a LB endpoint of the API server and you are running multiple API Servers, you might get confirmation from one of the API Server, while the others are still behind.
As a temporary workaround if this is a flake from the API Server, we could try to add a small delay on at the end of the wait_for_tpr function.
This is on multi-API server. We must have --etcd-quorum-read=true
.
That's why. The assumption was that we only had one API server. This issue is therefore ALSO affecting upstream Tectonic.
@Quentin-M gah, ok. Can you fix it here?
indeed, using etcd3 k8s defaults to the no-quorum-read etcd client, see https://github.com/kubernetes/kubernetes/blob/v1.5.4/pkg/storage/storagebackend/factory/etcd3.go#L58-L61.
Added --etcd-quorum-read=true
, but still getting the following error consistently:
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: ++ curl -sNL --cacert /tmp/tmp.KfhHd7nvQb --cert /tmp/tmp.DcuZjAOyIG --key /tmp/tmp.L5W4pB2PDN -o /dev/null --write-out '%{http_code}\n' -H 'Content-Type: application/yaml' '-dapiVersion: v1
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: kind: Namespace
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: metadata:
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: name: tectonic-system' https://sur-k8s.azure.ifup.org:443/api/v1/namespaces
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: + STATUS=000
Mar 22 09:54:14 tectonic-master-000000 bash[15455]: + rm -f /tmp/tmp.KfhHd7nvQb /tmp/tmp.DcuZjAOyIG /tmp/tmp.L5W4pB2PDN
Mar 22 09:54:14 tectonic-master-000000 systemd[1]: tectonic.service: Main process exited, code=exited, status=7/NOTRUNNING
Mar 22 09:54:14 tectonic-master-000000 systemd[1]: Failed to start Bootstrap a Tectonic cluster.
I will try to add --retry
and --retry-delay
to the create_resource
curl.
adding --retry
and --retry-delay
helped:
null_resource.tectonic: Still creating... (3m50s elapsed)
null_resource.tectonic: Creation complete
Apply complete! Resources: 102 added, 0 changed, 0 destroyed.
@sozercan hit the issue today as of master
. Re-opening.
Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Identity
Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Console
Mar 31 22:22:11 tectonic-master-000001 bash[7907]: Creating Tectonic Monitoring
Mar 31 22:22:12 tectonic-master-000001 bash[7907]: Waiting for third-party resource definitions...
Mar 31 22:22:22 tectonic-master-000001 bash[7907]: Failed to create monitoring/prometheus-k8s.json (got 404):
Mar 31 22:22:22 tectonic-master-000001 bash[7907]: {"apiVersion":"monitoring.coreos.com/v1alpha1","kind":"Prometheus","metadata":{"name":"k8s","namespace":"tectonic-system","selfLink":"/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses/k8s","uid":"8309ff63-1660-11e7-a781-000d3a1679e3","resourceVersion":"1003","creationTimestamp":"2017-03-31T22:22:22Z","labels":{"prometheus":"k8s"}},"spec":{"replicas":1,"resources":{"limits":{"cpu":"400m","memory":"2000Mi"},"requests":{"cpu":"200m","memory":"1500Mi"}},"serviceAccountName":"prometheus-k8s","version":"v1.5.2"}}
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Main process exited, code=exited, status=1/FAILURE
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: Failed to start Bootstrap a Tectonic cluster.
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Unit entered failed state.
Mar 31 22:22:22 tectonic-master-000001 systemd[1]: tectonic.service: Failed with result 'exit-code'.
I just hit this too.
To be more precise, now the 404 does not happen while waiting for the TPR resource definition to be ready (which is fixed) but while we create the TPR afterwards.
A more detailed trace:
Apr 04 09:25:37 sur-master-0 bash[6898]: + echo 'Waiting for third-party resource definitions...'
Apr 04 09:25:37 sur-master-0 bash[6898]: Waiting for third-party resource definitions...
Apr 04 09:25:37 sur-master-0 bash[6898]: + true
Apr 04 09:25:37 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:37 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:38 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:38 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:38 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:38 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:38 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:43 sur-master-0 bash[6898]: + true
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:43 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:43 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:43 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:43 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:43 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:48 sur-master-0 bash[6898]: + true
Apr 04 09:25:48 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:48 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:49 sur-master-0 bash[6898]: + local got_name=
Apr 04 09:25:49 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:49 sur-master-0 bash[6898]: + local status=404
Apr 04 09:25:49 sur-master-0 bash[6898]: + '[' '' == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:49 sur-master-0 bash[6898]: + sleep 5
Apr 04 09:25:54 sur-master-0 bash[6898]: + true
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 https://sur-k8s.dev.coreos.systems:443/apis/extensions/v1beta1/thirdpartyresources
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ jq -r '.items[].metadata | select(.name | contains("prometheus.monitoring.coreos.com")) | .name'
Apr 04 09:25:54 sur-master-0 bash[6898]: + local got_name=prometheus.monitoring.coreos.com
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: + local status=200
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' prometheus.monitoring.coreos.com == prometheus.monitoring.coreos.com ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 200 == 200 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + break
Apr 04 09:25:54 sur-master-0 bash[6898]: + create_resource json monitoring/prometheus-k8s.json apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: +++ cat tectonic/monitoring/prometheus-k8s.json
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -o /dev/null --write-out '%{http_code}\n' -H 'Content-Type: application/json' '-d{
Apr 04 09:25:54 sur-master-0 bash[6898]: "apiVersion": "monitoring.coreos.com/v1alpha1",
Apr 04 09:25:54 sur-master-0 bash[6898]: "kind": "Prometheus",
Apr 04 09:25:54 sur-master-0 bash[6898]: "metadata": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "name": "k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]: "namespace": "tectonic-system",
Apr 04 09:25:54 sur-master-0 bash[6898]: "labels": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "prometheus": "k8s"
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: },
Apr 04 09:25:54 sur-master-0 bash[6898]: "spec": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "replicas": 1,
Apr 04 09:25:54 sur-master-0 bash[6898]: "version": "v1.5.2",
Apr 04 09:25:54 sur-master-0 bash[6898]: "serviceAccountName": "prometheus-k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]: "resources": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "limits": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "cpu": "400m",
Apr 04 09:25:54 sur-master-0 bash[6898]: "memory": "2000Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]: },
Apr 04 09:25:54 sur-master-0 bash[6898]: "requests": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "cpu": "200m",
Apr 04 09:25:54 sur-master-0 bash[6898]: "memory": "1500Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: + STATUS=404
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 200 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 201 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + '[' 404 '!=' 409 ']'
Apr 04 09:25:54 sur-master-0 bash[6898]: + echo -e 'Failed to create monitoring/prometheus-k8s.json (got 404): '
Apr 04 09:25:54 sur-master-0 bash[6898]: Failed to create monitoring/prometheus-k8s.json (got 404):
Apr 04 09:25:54 sur-master-0 bash[6898]: ++ cat tectonic/monitoring/prometheus-k8s.json
Apr 04 09:25:54 sur-master-0 bash[6898]: + curl -sSNL --cacert /tmp/tmp.srQHwtBFCU --cert /tmp/tmp.4BwFIypJLQ --key /tmp/tmp.NQe0ILJy2U --retry-connrefused --retry 3 --retry-delay 2 -H 'Content-Type: application/json' '-d{
Apr 04 09:25:54 sur-master-0 bash[6898]: "apiVersion": "monitoring.coreos.com/v1alpha1",
Apr 04 09:25:54 sur-master-0 bash[6898]: "kind": "Prometheus",
Apr 04 09:25:54 sur-master-0 bash[6898]: "metadata": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "name": "k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]: "namespace": "tectonic-system",
Apr 04 09:25:54 sur-master-0 bash[6898]: "labels": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "prometheus": "k8s"
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: },
Apr 04 09:25:54 sur-master-0 bash[6898]: "spec": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "replicas": 1,
Apr 04 09:25:54 sur-master-0 bash[6898]: "version": "v1.5.2",
Apr 04 09:25:54 sur-master-0 bash[6898]: "serviceAccountName": "prometheus-k8s",
Apr 04 09:25:54 sur-master-0 bash[6898]: "resources": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "limits": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "cpu": "400m",
Apr 04 09:25:54 sur-master-0 bash[6898]: "memory": "2000Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]: },
Apr 04 09:25:54 sur-master-0 bash[6898]: "requests": {
Apr 04 09:25:54 sur-master-0 bash[6898]: "cpu": "200m",
Apr 04 09:25:54 sur-master-0 bash[6898]: "memory": "1500Mi"
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }
Apr 04 09:25:54 sur-master-0 bash[6898]: }' https://sur-k8s.dev.coreos.systems:443/apis/monitoring.coreos.com/v1alpha1/namespaces/tectonic-system/prometheuses
Apr 04 09:25:54 sur-master-0 bash[6898]: {
Apr 04 09:25:54 sur-master-0 bash[6898]: "kind": "Status",
Apr 04 09:25:54 sur-master-0 bash[6898]: "apiVersion": "v1",
Apr 04 09:25:54 sur-master-0 bash[6898]: "metadata": {},
Apr 04 09:25:54 sur-master-0 bash[6898]: "status": "Failure",
Apr 04 09:25:54 sur-master-0 bash[6898]: "message": "the server could not find the requested resource",
Apr 04 09:25:54 sur-master-0 bash[6898]: "reason": "NotFound",
Apr 04 09:25:54 sur-master-0 bash[6898]: "details": {},
Apr 04 09:25:54 sur-master-0 bash[6898]: "code": 404
Apr 04 09:25:54 sur-master-0 bash[6898]: }+ exit 1
Apr 04 09:25:54 sur-master-0 bash[6898]: + rm -f /tmp/tmp.srQHwtBFCU /tmp/tmp.4BwFIypJLQ /tmp/tmp.NQe0ILJy2U
As discussed with @alexsomesan the current plan is to move tectoni.sh
to kubectl
based logic to avoid all of the above problems.
Sounds straightforward. Does kubectl have retry/wait logic in case a resource type is missing?
Seems to not really have any, but just by using kubectl will cleanup the scripts a bit. Implementing retries around kubectl seems a little bit more manageable afterwards.
xref https://github.com/kubernetes/kubernetes/issues/29002 (possibly related)
Workaround was:
Error logs: