Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

Closed juan-lee closed 5 years ago

juan-lee commented 6 years ago

Symptoms Pods using the in cluster config to perform a watch on a resource will see intermittent timeouts and the following error in the pod log.

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR

If the client performing the watch isn't handling errors gracefully, applications can get into an inconsistent state. Impacted applications include, but are not limited to, nginx-ingress and tiller (helm).

A specific manifestation of this bug is the following error when attempting a helm deployment.

Error: watch closed before Until timeout

Root Cause

  1. The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.
  2. On the client side, the client-go implementation of watch.Until does not handle intermittent network failures gracefully. See watch closed before Until timeout, Fix broken watches, and Fix waiting in kubectl rollout status for more details.

Workaround For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

spec:
  containers:
    env:
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
    - name: KUBERNETES_PORT
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_SERVICE_HOST
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
LanceTheDev commented 5 years ago

Still getting the Issue: 2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR 2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work?

alibengtsson commented 5 years ago

Still getting the Issue: 2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR 2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work? try to only restart/delete the envoy ingress gateway and reboot/delete your pods check the logs again in ingress controller.

wutingbupt commented 5 years ago

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br, Tim

andig commented 5 years ago

I understood:

@strtdusty It's a feature flag they set on AKS. For now only applied to new clusters. Not sure if the feature can be disabled manually.

So you'll have to recreate the cluster.

mmosttler commented 5 years ago

@adinunzio84 You mentioned that with this fix there may need to be an additional ServiceEntry needed for Istio. We are having the issue described and also have Istio installed so I am curious if you have the fix and what ServiceEntry you had to add?

juan-lee commented 5 years ago

b

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br, Tim

The fix is no longer behind a feature flag. All new clusters will get it automatically. Existing clusters will need to do a scale or upgrade to get the fix.

wutingbupt commented 5 years ago

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning. We are in westeurope, and the cluster version is 1.11.3

Br, Tim

alibengtsson commented 5 years ago

Hi, i redeployed with terraform in westeurope and centralus region. with nginx ingress controller. I can confirm from my side that the messages are gone from the log file. i used version 1.11.2.

thanks for the fix.

juan-lee commented 5 years ago

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning. We are in westeurope, and the cluster version is 1.11.3

Br, Tim

Can you provide some more details? Does the pod in question have the appropriate KUBERNETES_ environment variables set.

m1o1 commented 5 years ago

@mmosttler Not fully tested, but this kinda works:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: azmk8s-ext
  namespace: default
spec:
  hosts:
  - ${FQDN}
  location: MESH_EXTERNAL
  ports:
  - name: https
    number: 443
    protocol: HTTPS
resolution: DNS

I did cat serviceentry.yml | envsubst | kubectl apply -f - with FQDN=$(az aks show -n ${CLUSTER_NAME} -g ${CLUSTER_RG} --query "fqdn" --output tsv)

This is the closest I got to it working. The javascript Kubernetes client is happy with this, but the Go client is not.

Please let me know if you come up with something better though

DenisBiondic commented 5 years ago

I can confirm that the problem is gone from our clusters. (west EU)

weinong commented 5 years ago

thanks for reporting back. i'm closing the issue for now

dharmeshkakadia commented 5 years ago

I am still seeing this issue even this week in east us.

juan-lee commented 5 years ago

I am still seeing this issue even this week in east us.

Can you elaborate on your scenario? Also, keep in mind that pods will need to be restarted in order to get the fix. You can check to see if a pod has the fix by seeing if KUBERNETES_PORT, etc env variables are set for each container.