Closed juan-lee closed 5 years ago
Is this not the same root cause as to performance degradation issue which is also still opened? https://github.com/Azure/AKS/issues/620
I saw this tagged from the TLS Handshake error thread — https://github.com/Azure/AKS/issues/14 Can anyone confirm this as one of the issues leading to the handshake timeout?
Would the above work around with environmental variables be a potential workaround for the timeout issue?
I saw this tagged from the TLS Handshake error thread — #14 Can anyone confirm this as one of the issues leading to the handshake timeout?
Would the above work around with environmental variables be a potential workaround for the timeout issue?
@necevil I don't think this is related to #14. This issue is about calling kube-apiserver from inside of the cluster vs. outside.
any Eta for permanent fix ?
any Eta for permanent fix ?
I don't have an ETA yet, but the fix is actively being worked on. The permanent fix will be to apply the same mitigation automatically for customer pods. I recommend applying the mitigation if you're running into this issue until we can have the fix automatically applied.
Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?
Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?
That's correct @DenisBiondic, we will apply the same change automatically. Just to clarify, the traffic will be routed over the azure backplane similar to how it works now except we will bypass the hop through azureproxy (kube-svc-redirect).
@juan-lee Will tunnelfront be removed then?
This is affecting cert-manager too. Errors continue after deleting & recreating the pod.
Yeah, it affects each container that has a connection to the master plane (different operators, external dns, ingress controller, cert manager etc.)
@holmesb if you're installing cert-manager via Helm you can add the following to your custom values.yaml (passed in with the -f
flag):
extraEnv:
- name: KUBERNETES_PORT_443_TCP_ADDR
value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
- name: KUBERNETES_PORT
value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_PORT_443_TCP
value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_SERVICE_HOST
value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
You can get your specific URL from the portal.
Note, tiller uses client-go so it inherits the bugs client-go has as was previously pointed out.
Also running into this problem using https://github.com/argoproj/argo-events. Currently using Azure AKS 1.11.3.
Does anyone know if there is a fix planned?
Workaround For the pods/containers that see the
INTERNAL_ERROR
in their logs add the following environment variables to the container spec. Be sure to replace<your-fqdn-prefix>
and<region>
so the aks kube-apiserver FQDN is correct.env: - name: KUBERNETES_PORT_443_TCP_ADDR value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io - name: KUBERNETES_PORT value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443 - name: KUBERNETES_PORT_443_TCP value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443 - name: KUBERNETES_SERVICE_HOST value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
Is there a possibility to inject this into a tiller-deploy pod already running on my cluster or does this have to be done at initialisation time? Can I even pass these environment variables as parameters or is this only possbile by changing the yaml?
EDIT:
Nevermind, I found out that you have to add these lines to the deployment by kubectl edit deployment <deployment-name> -n <namespace>
. Afterwards the pods will get updated.
BTW: using the standard editor (vi) caused problems for updating in my terminal, thus using KUBE_EDITOR="nano" kubectl edit deployment <deployment-name> -n <namespace>
might be a good idea if you are having issues too.
I'm running into the same issue.
I've tried the workaround described by @dkipping to no avail. Setting the environment variables on the tiller-deploy
doesn't seem to work.
Edit:
I am also having trouble identifying the pod that is having the issues. So far I've only seen it happen in the logs of kube-ingress-nginx-ingress-controller
. Setting the environment variables on the tiller-deploy
doesn't feel right. I would expect these variables on the kube-ingress-nginx-ingress-controller
. Then again setting these variables in general just feels like a hack.
@evandervalk could you apply the values to the tiller-deploy
deployment at least? so are they saved and still there when you print out or enter the specs of the tiller-deploy
again?
If they got applied, could you check that they got applied to a (newly spawned) tiller-deploy-<id>
pod too?
Edit:
Regarding your edit: you should set these values not in the tiller-deploy
but on the kube-ingress-nginx-ingress-controller
if the error only occurs in the logs of this pod according to @juan-lee.
But if the error occurs during the install via helm / tiller, you should carefully check tiller's logs. If it occurs there, setting these values on tiller-deploy
might be needed too.
Editing the deployment went as you described. I can edit the values in the tiller-deploy
and it prints them out when describing the resource.
After editing the configuration it created a new pod with the earlier mentions environment variables defined. So that also seems to work. I have the variables defined in the tiller-deploy-<id>
pod.
However, after editing the configuration and making sure everything propagated, I am still seeing:
streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1677; INTERNAL_ERROR
multiple times on the kube-ingress-nginx-ingress-controller
.
If I describe the pod for kube-ingress-nginx-ingress-controller
I do not see these environment variables.
I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller
rather than tiller-deploy
?
I would assume these environment variables should be defined on
kube-ingress-nginx-ingress-controller
rather thantiller-deploy
?
Yes, sounds like it. I guess you can't break more than not having it running by trying to add the environment variables there.
I will try and report back.
Edit: I've removed the environment variables from tiller-deploy
and added them to kube-ingress-nginx-ingress-controller
. Monitoring the logs to see if I come across the error message again. So far it's looking good.
Edit 2: Setting the environment variables on the kube-ingress-nginx-ingress-controller
solved my issues. Thank you @dkipping for taking the time to walk me through it.
Workaround works for me actually
On Tue, Oct 16, 2018, 5:29 PM E. van der Valk notifications@github.com wrote:
I will try and report back.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/676#issuecomment-430283407, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHXfQC1ZEEACLthOd9N-ddSDqbJhVjRks5ulftigaJpZM4XGbpl .
The workaround of patching my Tiller deployment with those variables fixed my Helm issues,.
For anyone wondering, the value you need to use "
@juan-lee Any update of when microsoft will have this problem fixed?
Can we have a rough idea about when the fix will be released.
We cant update the Deployment definitions in some cases (https://github.com/gardener/gardener/issues/462) and we've been thinking about implementing a mutating admission webhook to inject these extra env vars automatically to certain pods (based on some annotation or som other pod properties). But it doesn't seem to be a trivial task and I'm not sure if it's easy to obtain the external apiserver endpoint within the cluster.
Also if the fix will be coming soon, all this work will be redundant. So having an estimate here could help us saving time.
Can we have a rough idea about when the fix will be released.
We cant update the Deployment definitions in some cases (gardener/gardener#462) and we've been thinking about implementing a mutating admission webhook to inject these extra env vars automatically to certain pods (based on some annotation or som other pod properties). But it doesn't seem to be a trivial task and I'm not sure if it's easy to obtain the external apiserver endpoint within the cluster.
Also if the fix will be coming soon, all this work will be redundant. So having an estimate here could help us saving time.
The fix has been merged and will rollout over the course of the next week. @bergerx
@juan-lee That’s great news. Can you keep us updated when the fix has been rolled out? I’m personally interested in the rollout for West Europe.
Amazing news! Please keep us posted for regions: EastUS and EastUS2. Thank you
@juan-lee if you could post any info about how we will know when our clusters are patched that would be super cool.
@strtdusty One way is to simply monitoring for the log message (stream ID 1; INTERNAL_ERROR). We have multiple thousands of these log errors ATM)
There must be a better way than to just look for a decrease in those error logs. If your fix involves universally changing KUBERNETES_PORT_443_TCP_ADDR etc then we could exec into a pod and check the environment variable? There must be some definitive way of knowing.
Similarly you can do kubectl get pod <pod_name> -o jsonpath='{.spec.containers[*].env[?(@.name=="KUBERNETES_PORT_443_TCP_ADDR")].value}'
Thanks @andrew-dinunzio that seems like a good idea.
Update: the fix is rolling out, but we'll be testing it in production before making it broadly available. If everything goes as planned, we'll release the fix next week.
@juan-lee Is it possible to opt in for this fix before it's broadly available?
@apeschel and to anyone who wants to preview the fix, please send your subscription ID to aks-help@service.microsoft.com and reference to this issue.
Thanks
@juan-lee wrote:
Root cause: The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.
I'm experiencing the streamwatcher errors in a cluster that does not have azureproxy deployed. They appear for example in the cluster autoscaler logs. Is that the same issue or a different one?
Note: you can get the API server like this:
kubectl config view -o jsonpath='{.clusters[?(@.name=="<your cluster>")].cluster.server}'
@juan-lee Similar error happens also with Java based apps, if they use Fabric8 / Kubernetes Client library (including Spring Cloud Kubernetes based apps).
I was able to apply a similar container env variable for my apps like this:
- name: KUBERNETES_MASTER
value: "<your-fqdn-prefix>.hcp.<region>.azmk8s.io"
Should I create a separate issue, so a fix for that could be also done?
My fix preview results so far:
I've just done a couple of helm deployments which used to fail (often, but not always) before the fix. Also so far no timeout errors for cert-manager and external-dns.
Looks promising!
I can confirm that after deploying the fix on cluster autoscaler the errors are gone from the logs. I'm still seeing https://github.com/Azure/AKS/issues/704 but that might be unrelated.
@nphmuller curious about the preview fix ... was it Microsoft applied or did they give you a mutating webhook to apply manually? Wondering if we applied it and had problems what the rollback would look like.
@strtdusty It's a feature flag they set on AKS. For now only applied to new clusters. Not sure if the feature can be disabled manually.
I get the following error while trying to use the workaround mentioned by @juan-lee in the original post.
x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "ca")
Looks like the azure control plane doesnt have the right certs or the pod needs additional config?
It's been a while since the last "official" note now; is there any news on general availability of the fix? Also, will the permanent fix require us to recreate clusters?
@dharmeshkakadia I was asking about this the other day but didn’t get an answer. It seems that all the AKS apiserver certs are self-signed. I’m a little shocked that kubectl doesn’t provide a warning by default. Currently, any pod that connects to the apiserver will need to ignore certificate trust.
Hi all, I too am wondering when the fix for this is going to be rolled out and if we'll have to do anything to benefit from it. We have two, small AKS clusters but they're running business critical applications and seeing these errors coming from the nginx-ingress-controller makes me nervous.
https://azure.microsoft.com/en-us/updates/watch-stream-error-fix/ we will be rolling this out to all clusters starting next week.
I imagine people using Istio will now need to create a ServiceEntry object to access the apiserver, since the FQDN won't be in Istio's service "registry".
@weinong : Good to know. Is this being rolled out region by region, and if so, is there a set sequence? I am particularly interested in when this will be rolled out in region westeurope.
West Europe will be later this week. It will be applied to all new clusters at first. Then we will roll it out to all existing clusters shortly after.
Hi,when will the fix be released for centralus and southeastasia ?
thanks in advanced
fix has reached westeurope. Indeed, it has solved several issues. Thx.
Symptoms Pods using the in cluster config to perform a watch on a resource will see intermittent timeouts and the following error in the pod log.
streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR
If the client performing the watch isn't handling errors gracefully, applications can get into an inconsistent state. Impacted applications include, but are not limited to, nginx-ingress and tiller (helm).
A specific manifestation of this bug is the following error when attempting a helm deployment.
Error: watch closed before Until timeout
Root Cause
Workaround For the pods/containers that see the
INTERNAL_ERROR
in their logs add the following environment variables to the container spec. Be sure to replace<your-fqdn-prefix>
and<region>
so the aks kube-apiserver FQDN is correct.