Closed coala-svn closed 3 years ago
After downgrading Ambassador version to 1.11.2 the issue is not reproducible. So, looks like it is an issue in 1.12.x
We are also seeing a similar (same?) problem with ambassador 1.12.1 deployed into our nonprod environment. A fresh ambassador pod works like a champ. Any change doesn't seem to be reflected into the envoy configuration. Example events that don't seem to cause a reconfiguration:
rollout restart
In all cases, rollout restart
of ambassador resolves the issue.In researching the issue, it looks like 1.12.1 switched to using EDS. Is it possible that the EDS service is not reflecting cluster changes?
Also facing a similar issue. 1.12.0 was my first use of Ambassador and I thought I had misconfigured something. Rolling back to 1.11.2 resolved the issue.
I started monitoring the snapshots/snapshot.yaml
file while performing rollout restart
of a deployment. Ambassador creates the correct information in this file at startup, but it does not get updated with new endpoint IPs when I roll a deployment. The ambassador documentation indicates that this is likely a configuration issue of some sort.
Can you post the Mapping resources for which you are experiencing this behavior?
Here is an example of mapping:
$ kubectl -n kangaroo277id100006 describe mapping ambassador-ms-service-0
Name: ambassador-ms-service-0
Namespace: kangaroo277id100006
Labels: app.kubernetes.io/managed-by=Helm
Annotations: getambassador.io/resource-changed: true
meta.helm.sh/release-name: amb-rules
meta.helm.sh/release-namespace: kangaroo277id100006
API Version: getambassador.io/v2
Kind: Mapping
Metadata:
Creation Timestamp: 2021-04-08T19:53:49Z
Generation: 1
Managed Fields:
API Version: getambassador.io/v2
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:getambassador.io/resource-changed:
f:meta.helm.sh/release-name:
f:meta.helm.sh/release-namespace:
f:labels:
.:
f:app.kubernetes.io/managed-by:
f:spec:
.:
f:connect_timeout_ms:
f:host:
f:idle_timeout_ms:
f:load_balancer:
.:
f:cookie:
.:
f:name:
f:path:
f:ttl:
f:policy:
f:prefix:
f:resolver:
f:rewrite:
f:service:
f:timeout_ms:
Manager: Go-http-client
Operation: Update
Time: 2021-04-08T19:53:49Z
Resource Version: 7903038
Self Link: /apis/getambassador.io/v2/namespaces/kangaroo277id100006/mappings/ambassador-ms-service-0
UID: 6653b0e3-3d3c-4260-a945-284e807a66f7
Spec:
connect_timeout_ms: 6000
Host: wcdp-windchill-kangaroo277id100006.rd-plm-devops.bdns.ptc.com
idle_timeout_ms: 5000000
load_balancer:
Cookie:
Name: sticky-cookie-0
Path: /Windchill
Ttl: 600s
Policy: ring_hash
Prefix: /Windchill
Resolver: endpoint
Rewrite: /Windchill
Service: ms-service-kangaroo-0.kangaroo277id100006.svc.cluster.local:8080
timeout_ms: 0
Events: <none>
I believe if you drop the .svc.cluster.local
from the service name it should fix the problem. When you are using the kubernetes endpoint routing resolver the service field refers directly to a kubernetes resource, not to a dns name. The .svc.cluster.local
suffix is added by the kubernetes dns server so it is a bit weird to use it when you aren't doing a dns lookup.
That said this is a bug because we used to allow that and a) we shouldn't disallow that without a deprecation period, and b) we should also be logging it as an error.
Thanks!. That appears to be exactly my issue. Re-reading the documentation, I see that the DNS name is not recommended - only the claim that it might work. Why did I not notice this previously??? I have tested this change successfully with a few mapping files.
@rhs - thanks for letting us know a workaround. However, the problem here is that we specifically were told by someone from Dataware team to use the suffix .svc.cluster.local
for the service when they helped us with the update of our application deployment for Ambassador integration (I was not a part of that discussion and I learned about the recommendation only today when we internally discussed testing of the potential workaround).
We observed very similar behavior, except that "no healthy upstream" error went away once mapping is re-loaded. We tried removing "http://" prefix from the service name as was suggested in Slack for a similar situation and it seemed to do the trick. But it is unclear what is the underlying cause and what is the correct way of specifying mappings to avoid this sort of scenarios.
This is fixed in Ambassador 1.13.0, which is now available.
Confirmed.
Thank you guys for the prompt fix!
We are noticing this behavior on 1.13.5 still. Please advise.
@wissam-launchtrip can you go into a bit more detail? Are you seeing this exact issue or something similar? Anything that can help us verify the report and reproduce the issue for a possible fix 👍
No actually it's a different issue. Upstream Services get disconnected for no clear reason! And we get "no healthy upstream" error. This happens after a few hours from last deployment in the cluster. If we make a deployment in the cluster, the error disappears.
Description of the problem I am facing a very strange problem. Our IT wants us to migrate Application testing pipeline to a new cluster. After deploying ambassador with helm (originally it was 1.12.0) I tested the deployments of our applications: all the deployments were successful, however on access to the application I constantly got an error "no healthy upstream" (the same deployment works in the old cluster).
At some point in time I learned about released 1.12.1 and upgraded the ambassador with "helm upgrade" to 1.12.1. After that all the old not working application deployments started to work without any additional changes. But every new deployment had the same issue: the error "no healthy upstream". Eventually ambassador was upgraded to 1.12.2 with the same effect: not working the old deployments started to work without any changes and every new deployment had an error "no healthy upstream".
Investigation of connectivity confirmed that the application is accessible with curl from ambassador pod via connection to the app service, as well as to the app in pod directly. However, external requests to the application always ended up with "no healthy upstream".
Now, if the ambassador pod is killed (replica count was reduced to 1 for simplifying logs analysis) and the deployment/replicaset replaces it with a new pod the issue is resolved - all not working deployments start working (it was tested 3 times).
Details on the current deployment:
Is it something that I might be missing during the deployment of ambassador?nd concise description of what the bug is.
Expected behavior All the new application deployments start working without a need to restart ambassador pods
Versions:
Additional context None. I am not sure if it is a bug or not. I would appreciate any workaround for our environment.