Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

Closed juan-lee closed 5 years ago

juan-lee commented 5 years ago

Symptoms Pods using the in cluster config to perform a watch on a resource will see intermittent timeouts and the following error in the pod log.

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR

If the client performing the watch isn't handling errors gracefully, applications can get into an inconsistent state. Impacted applications include, but are not limited to, nginx-ingress and tiller (helm).

A specific manifestation of this bug is the following error when attempting a helm deployment.

Error: watch closed before Until timeout

Root Cause

  1. The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.
  2. On the client side, the client-go implementation of watch.Until does not handle intermittent network failures gracefully. See watch closed before Until timeout, Fix broken watches, and Fix waiting in kubectl rollout status for more details.

Workaround For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

spec:
  containers:
    env:
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
    - name: KUBERNETES_PORT
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_SERVICE_HOST
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
DenisBiondic commented 5 years ago

Is this not the same root cause as to performance degradation issue which is also still opened? https://github.com/Azure/AKS/issues/620

necevil commented 5 years ago

I saw this tagged from the TLS Handshake error thread — https://github.com/Azure/AKS/issues/14 Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

juan-lee commented 5 years ago

I saw this tagged from the TLS Handshake error thread — #14 Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

@necevil I don't think this is related to #14. This issue is about calling kube-apiserver from inside of the cluster vs. outside.

digeler commented 5 years ago

any Eta for permanent fix ?

juan-lee commented 5 years ago

any Eta for permanent fix ?

I don't have an ETA yet, but the fix is actively being worked on. The permanent fix will be to apply the same mitigation automatically for customer pods. I recommend applying the mitigation if you're running into this issue until we can have the fix automatically applied.

DenisBiondic commented 5 years ago

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

juan-lee commented 5 years ago

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

That's correct @DenisBiondic, we will apply the same change automatically. Just to clarify, the traffic will be routed over the azure backplane similar to how it works now except we will bypass the hop through azureproxy (kube-svc-redirect).

DenisBiondic commented 5 years ago

@juan-lee Will tunnelfront be removed then?

holmesb commented 5 years ago

This is affecting cert-manager too. Errors continue after deleting & recreating the pod.

DenisBiondic commented 5 years ago

Yeah, it affects each container that has a connection to the master plane (different operators, external dns, ingress controller, cert manager etc.)

mattfarina commented 5 years ago

@holmesb if you're installing cert-manager via Helm you can add the following to your custom values.yaml (passed in with the -f flag):

extraEnv:
- name: KUBERNETES_PORT_443_TCP_ADDR
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
- name: KUBERNETES_PORT
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_PORT_443_TCP
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_SERVICE_HOST
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

You can get your specific URL from the portal.

Note, tiller uses client-go so it inherits the bugs client-go has as was previously pointed out.

ttelfer commented 5 years ago

Also running into this problem using https://github.com/argoproj/argo-events. Currently using Azure AKS 1.11.3.

Does anyone know if there is a fix planned?

dkipping commented 5 years ago

Workaround For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

        env:
        - name: KUBERNETES_PORT_443_TCP_ADDR
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
        - name: KUBERNETES_PORT
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_PORT_443_TCP
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_SERVICE_HOST
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

Is there a possibility to inject this into a tiller-deploy pod already running on my cluster or does this have to be done at initialisation time? Can I even pass these environment variables as parameters or is this only possbile by changing the yaml?

EDIT:

Nevermind, I found out that you have to add these lines to the deployment by kubectl edit deployment <deployment-name> -n <namespace>. Afterwards the pods will get updated. BTW: using the standard editor (vi) caused problems for updating in my terminal, thus using KUBE_EDITOR="nano" kubectl edit deployment <deployment-name> -n <namespace> might be a good idea if you are having issues too.

evandervalk commented 5 years ago

I'm running into the same issue.

I've tried the workaround described by @dkipping to no avail. Setting the environment variables on the tiller-deploy doesn't seem to work.

Edit: I am also having trouble identifying the pod that is having the issues. So far I've only seen it happen in the logs of kube-ingress-nginx-ingress-controller. Setting the environment variables on the tiller-deploy doesn't feel right. I would expect these variables on the kube-ingress-nginx-ingress-controller. Then again setting these variables in general just feels like a hack.

dkipping commented 5 years ago

@evandervalk could you apply the values to the tiller-deploy deployment at least? so are they saved and still there when you print out or enter the specs of the tiller-deploy again?

If they got applied, could you check that they got applied to a (newly spawned) tiller-deploy-<id> pod too?

Edit: Regarding your edit: you should set these values not in the tiller-deploy but on the kube-ingress-nginx-ingress-controller if the error only occurs in the logs of this pod according to @juan-lee. But if the error occurs during the install via helm / tiller, you should carefully check tiller's logs. If it occurs there, setting these values on tiller-deploy might be needed too.

evandervalk commented 5 years ago

Editing the deployment went as you described. I can edit the values in the tiller-deploy and it prints them out when describing the resource.

After editing the configuration it created a new pod with the earlier mentions environment variables defined. So that also seems to work. I have the variables defined in the tiller-deploy-<id> pod.

However, after editing the configuration and making sure everything propagated, I am still seeing: streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1677; INTERNAL_ERROR multiple times on the kube-ingress-nginx-ingress-controller.

If I describe the pod for kube-ingress-nginx-ingress-controller I do not see these environment variables.

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

dkipping commented 5 years ago

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

Yes, sounds like it. I guess you can't break more than not having it running by trying to add the environment variables there.

evandervalk commented 5 years ago

I will try and report back.

Edit: I've removed the environment variables from tiller-deploy and added them to kube-ingress-nginx-ingress-controller. Monitoring the logs to see if I come across the error message again. So far it's looking good.

Edit 2: Setting the environment variables on the kube-ingress-nginx-ingress-controller solved my issues. Thank you @dkipping for taking the time to walk me through it.

wutingbupt commented 5 years ago

Workaround works for me actually

On Tue, Oct 16, 2018, 5:29 PM E. van der Valk notifications@github.com wrote:

I will try and report back.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/676#issuecomment-430283407, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHXfQC1ZEEACLthOd9N-ddSDqbJhVjRks5ulftigaJpZM4XGbpl .

jonathanjfshaw commented 5 years ago

The workaround of patching my Tiller deployment with those variables fixed my Helm issues,.

For anyone wondering, the value you need to use ".hcp..azmk8s.io" can be found in the overview page of the Azure UI for your cluster, called "API server address".

ttelfer commented 5 years ago

@juan-lee Any update of when microsoft will have this problem fixed?

bergerx commented 5 years ago

Can we have a rough idea about when the fix will be released.

We cant update the Deployment definitions in some cases (https://github.com/gardener/gardener/issues/462) and we've been thinking about implementing a mutating admission webhook to inject these extra env vars automatically to certain pods (based on some annotation or som other pod properties). But it doesn't seem to be a trivial task and I'm not sure if it's easy to obtain the external apiserver endpoint within the cluster.

Also if the fix will be coming soon, all this work will be redundant. So having an estimate here could help us saving time.

juan-lee commented 5 years ago

Can we have a rough idea about when the fix will be released.

We cant update the Deployment definitions in some cases (gardener/gardener#462) and we've been thinking about implementing a mutating admission webhook to inject these extra env vars automatically to certain pods (based on some annotation or som other pod properties). But it doesn't seem to be a trivial task and I'm not sure if it's easy to obtain the external apiserver endpoint within the cluster.

Also if the fix will be coming soon, all this work will be redundant. So having an estimate here could help us saving time.

The fix has been merged and will rollout over the course of the next week. @bergerx

nphmuller commented 5 years ago

@juan-lee That’s great news. Can you keep us updated when the fix has been rolled out? I’m personally interested in the rollout for West Europe.

agolomoodysaada commented 5 years ago

Amazing news! Please keep us posted for regions: EastUS and EastUS2. Thank you

strtdusty commented 5 years ago

@juan-lee if you could post any info about how we will know when our clusters are patched that would be super cool.

DenisBiondic commented 5 years ago

@strtdusty One way is to simply monitoring for the log message (stream ID 1; INTERNAL_ERROR). We have multiple thousands of these log errors ATM)

strtdusty commented 5 years ago

There must be a better way than to just look for a decrease in those error logs. If your fix involves universally changing KUBERNETES_PORT_443_TCP_ADDR etc then we could exec into a pod and check the environment variable? There must be some definitive way of knowing.

m1o1 commented 5 years ago

Similarly you can do kubectl get pod <pod_name> -o jsonpath='{.spec.containers[*].env[?(@.name=="KUBERNETES_PORT_443_TCP_ADDR")].value}'

strtdusty commented 5 years ago

Thanks @andrew-dinunzio that seems like a good idea.

juan-lee commented 5 years ago

Update: the fix is rolling out, but we'll be testing it in production before making it broadly available. If everything goes as planned, we'll release the fix next week.

apeschel commented 5 years ago

@juan-lee Is it possible to opt in for this fix before it's broadly available?

weinong commented 5 years ago

@apeschel and to anyone who wants to preview the fix, please send your subscription ID to aks-help@service.microsoft.com and reference to this issue.

Thanks

andig commented 5 years ago

@juan-lee wrote:

Root cause: The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.

I'm experiencing the streamwatcher errors in a cluster that does not have azureproxy deployed. They appear for example in the cluster autoscaler logs. Is that the same issue or a different one?

andig commented 5 years ago

Note: you can get the API server like this:

kubectl config view -o jsonpath='{.clusters[?(@.name=="<your cluster>")].cluster.server}'
hazsetata commented 5 years ago

@juan-lee Similar error happens also with Java based apps, if they use Fabric8 / Kubernetes Client library (including Spring Cloud Kubernetes based apps).

I was able to apply a similar container env variable for my apps like this:

- name: KUBERNETES_MASTER
  value: "<your-fqdn-prefix>.hcp.<region>.azmk8s.io"

Should I create a separate issue, so a fix for that could be also done?

nphmuller commented 5 years ago

My fix preview results so far:

I've just done a couple of helm deployments which used to fail (often, but not always) before the fix. Also so far no timeout errors for cert-manager and external-dns.

Looks promising!

andig commented 5 years ago

I can confirm that after deploying the fix on cluster autoscaler the errors are gone from the logs. I'm still seeing https://github.com/Azure/AKS/issues/704 but that might be unrelated.

strtdusty commented 5 years ago

@nphmuller curious about the preview fix ... was it Microsoft applied or did they give you a mutating webhook to apply manually? Wondering if we applied it and had problems what the rollback would look like.

nphmuller commented 5 years ago

@strtdusty It's a feature flag they set on AKS. For now only applied to new clusters. Not sure if the feature can be disabled manually.

dharmeshkakadia commented 5 years ago

I get the following error while trying to use the workaround mentioned by @juan-lee in the original post.

x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "ca")

Looks like the azure control plane doesnt have the right certs or the pod needs additional config?

fhoy commented 5 years ago

It's been a while since the last "official" note now; is there any news on general availability of the fix? Also, will the permanent fix require us to recreate clusters?

mbrancato commented 5 years ago

@dharmeshkakadia I was asking about this the other day but didn’t get an answer. It seems that all the AKS apiserver certs are self-signed. I’m a little shocked that kubectl doesn’t provide a warning by default. Currently, any pod that connects to the apiserver will need to ignore certificate trust.

erewok commented 5 years ago

Hi all, I too am wondering when the fix for this is going to be rolled out and if we'll have to do anything to benefit from it. We have two, small AKS clusters but they're running business critical applications and seeing these errors coming from the nginx-ingress-controller makes me nervous.

weinong commented 5 years ago

https://azure.microsoft.com/en-us/updates/watch-stream-error-fix/ we will be rolling this out to all clusters starting next week.

m1o1 commented 5 years ago

I imagine people using Istio will now need to create a ServiceEntry object to access the apiserver, since the FQDN won't be in Istio's service "registry".

fhoy commented 5 years ago

@weinong : Good to know. Is this being rolled out region by region, and if so, is there a set sequence? I am particularly interested in when this will be rolled out in region westeurope.

weinong commented 5 years ago

West Europe will be later this week. It will be applied to all new clusters at first. Then we will roll it out to all existing clusters shortly after.

alibengtsson commented 5 years ago

Hi,when will the fix be released for centralus and southeastasia ?

thanks in advanced

martinhartig commented 5 years ago

fix has reached westeurope. Indeed, it has solved several issues. Thx.