Observing "no healthy upstream" for new deployments until ambassador pods restarted

coala-svn commented 3 years ago

Description of the problem I am facing a very strange problem. Our IT wants us to migrate Application testing pipeline to a new cluster. After deploying ambassador with helm (originally it was 1.12.0) I tested the deployments of our applications: all the deployments were successful, however on access to the application I constantly got an error "no healthy upstream" (the same deployment works in the old cluster).

At some point in time I learned about released 1.12.1 and upgraded the ambassador with "helm upgrade" to 1.12.1. After that all the old not working application deployments started to work without any additional changes. But every new deployment had the same issue: the error "no healthy upstream". Eventually ambassador was upgraded to 1.12.2 with the same effect: not working the old deployments started to work without any changes and every new deployment had an error "no healthy upstream".

Investigation of connectivity confirmed that the application is accessible with curl from ambassador pod via connection to the app service, as well as to the app in pod directly. However, external requests to the application always ended up with "no healthy upstream".

Now, if the ambassador pod is killed (replica count was reduced to 1 for simplifying logs analysis) and the deployment/replicaset replaces it with a new pod the issue is resolved - all not working deployments start working (it was tested 3 times).

Details on the current deployment:

$ helm -n ambassador list
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
ambassador      ambassador      14              2021-03-31 10:20:18.8370383 -0400 EDT   deployed        ambassador-6.6.2        1.12.2

Is it something that I might be missing during the deployment of ambassador?nd concise description of what the bug is.

Expected behavior All the new application deployments start working without a need to restart ambassador pods

Versions:

Ambassador: 1.12.2 (1.12.0, 1.12.1)
Kubernetes environment: Azure Kubernetes Service (AKS) - privatelink custer (i,e. no access from the public internet and only internal LBs - Annotation "service.beta.kubernetes.io/azure-load-balancer-internal" is set to "true")
Version: v1.18.14

Additional context None. I am not sure if it is a bug or not. I would appreciate any workaround for our environment.

coala-svn commented 3 years ago

After downgrading Ambassador version to 1.11.2 the issue is not reproducible. So, looks like it is an issue in 1.12.x

rdmoore commented 3 years ago

We are also seeing a similar (same?) problem with ambassador 1.12.1 deployed into our nonprod environment. A fresh ambassador pod works like a champ. Any change doesn't seem to be reflected into the envoy configuration. Example events that don't seem to cause a reconfiguration:

a HPA scale-down (removal of a pod),
a new deployment
a rollout restart In all cases, rollout restart of ambassador resolves the issue.

In researching the issue, it looks like 1.12.1 switched to using EDS. Is it possible that the EDS service is not reflecting cluster changes?

rhysm commented 3 years ago

Also facing a similar issue. 1.12.0 was my first use of Ambassador and I thought I had misconfigured something. Rolling back to 1.11.2 resolved the issue.

rdmoore commented 3 years ago

I started monitoring the snapshots/snapshot.yaml file while performing rollout restart of a deployment. Ambassador creates the correct information in this file at startup, but it does not get updated with new endpoint IPs when I roll a deployment. The ambassador documentation indicates that this is likely a configuration issue of some sort.

Is this file still is expected to be updated (with the latest EDS changes)?
what kinds of issues might cause a problem merging in new k8s information but not cause an issue creating the original file?

rhs commented 3 years ago

Can you post the Mapping resources for which you are experiencing this behavior?

coala-svn commented 3 years ago

Here is an example of mapping:

$ kubectl -n kangaroo277id100006 describe mapping ambassador-ms-service-0

Name:         ambassador-ms-service-0
Namespace:    kangaroo277id100006
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  getambassador.io/resource-changed: true
              meta.helm.sh/release-name: amb-rules
              meta.helm.sh/release-namespace: kangaroo277id100006
API Version:  getambassador.io/v2
Kind:         Mapping
Metadata:
  Creation Timestamp:  2021-04-08T19:53:49Z
  Generation:          1
  Managed Fields:
    API Version:  getambassador.io/v2
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:getambassador.io/resource-changed:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:connect_timeout_ms:
        f:host:
        f:idle_timeout_ms:
        f:load_balancer:
          .:
          f:cookie:
            .:
            f:name:
            f:path:
            f:ttl:
          f:policy:
        f:prefix:
        f:resolver:
        f:rewrite:
        f:service:
        f:timeout_ms:
    Manager:         Go-http-client
    Operation:       Update
    Time:            2021-04-08T19:53:49Z
  Resource Version:  7903038
  Self Link:         /apis/getambassador.io/v2/namespaces/kangaroo277id100006/mappings/ambassador-ms-service-0
  UID:               6653b0e3-3d3c-4260-a945-284e807a66f7
Spec:
  connect_timeout_ms:  6000
  Host:                wcdp-windchill-kangaroo277id100006.rd-plm-devops.bdns.ptc.com
  idle_timeout_ms:     5000000
  load_balancer:
    Cookie:
      Name:    sticky-cookie-0
      Path:    /Windchill
      Ttl:     600s
    Policy:    ring_hash
  Prefix:      /Windchill
  Resolver:    endpoint
  Rewrite:     /Windchill
  Service:     ms-service-kangaroo-0.kangaroo277id100006.svc.cluster.local:8080
  timeout_ms:  0
Events:        <none>

rhs commented 3 years ago

I believe if you drop the .svc.cluster.local from the service name it should fix the problem. When you are using the kubernetes endpoint routing resolver the service field refers directly to a kubernetes resource, not to a dns name. The .svc.cluster.local suffix is added by the kubernetes dns server so it is a bit weird to use it when you aren't doing a dns lookup.

That said this is a bug because we used to allow that and a) we shouldn't disallow that without a deprecation period, and b) we should also be logging it as an error.

rdmoore commented 3 years ago

Thanks!. That appears to be exactly my issue. Re-reading the documentation, I see that the DNS name is not recommended - only the claim that it might work. Why did I not notice this previously??? I have tested this change successfully with a few mapping files.

coala-svn commented 3 years ago

@rhs - thanks for letting us know a workaround. However, the problem here is that we specifically were told by someone from Dataware team to use the suffix .svc.cluster.local for the service when they helped us with the update of our application deployment for Ambassador integration (I was not a part of that discussion and I learned about the recommendation only today when we internally discussed testing of the potential workaround).

illinar commented 3 years ago

We observed very similar behavior, except that "no healthy upstream" error went away once mapping is re-loaded. We tried removing "http://" prefix from the service name as was suggested in Slack for a similar situation and it seemed to do the trick. But it is unclear what is the underlying cause and what is the correct way of specifying mappings to avoid this sort of scenarios.

khussey commented 3 years ago

This is fixed in Ambassador 1.13.0, which is now available.

coala-svn commented 3 years ago

Confirmed.

Thank you guys for the prompt fix!

wissam-launchtrip commented 3 years ago

We are noticing this behavior on 1.13.5 still. Please advise.

esmet commented 3 years ago

@wissam-launchtrip can you go into a bit more detail? Are you seeing this exact issue or something similar? Anything that can help us verify the report and reproduce the issue for a possible fix 👍

wissam-launchtrip commented 3 years ago

No actually it's a different issue. Upstream Services get disconnected for no clear reason! And we get "no healthy upstream" error. This happens after a few hours from last deployment in the cluster. If we make a deployment in the cluster, the error disappears.

emissary-ingress / emissary

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324