Closed knlambert closed 4 years ago
Probably also related to #1661. I know it works by increasing the waiting time, but this shouldn't be something to be done manually (in my case I'm using the chart).
Hello,
Tested today with the 0.81.0, same problem. I've tested with some HTTPbin services using a different TLS certificate for each endpoint :
apiVersion: v1
kind: Service
metadata:
annotations:
getambassador.io/config: |
apiVersion: ambassador/v1
kind: Mapping
ambassador_id: staging
name: alpha-httpbin_mapping
prefix: /
rewrite: /
service: alpha-httpbin.1111-nonproduction-staging.svc.cluster.local
host: alpha-httpbin-test.mgmt.staging.1111.domain.com
---
apiVersion: ambassador/v1
kind: TLSContext
ambassador_id: staging
name: alpha-httpbin_mapping
hosts:
- alpha-httpbin-test.mgmt.staging.1111.domain.com
secret: alpha-httpbin-certificate.1111-nonproduction-staging
creationTimestamp: "2019-09-30T16:15:52Z"
name: alpha-httpbin
namespace: 1111-nonproduction-staging
Here the Envoy /stats endpoint return when it happens :
Ambassador check alive :
ambassador seems to have died (23 seconds)
Envoy stats at the same time :
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Still waiting, don't close.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Describe the bug
Ambassador live probe sends "ambassador seems to have died" during k8s watch initializations.
I've put a simple bash script during the initialization with a kubectl port-forward to evaluate how long it lasts :
Here the post-mortem :
Ambassador is down during 24 s, and it's a problem because in your official helmchart, the liveness probe initial delay is only 30s (at this point, ambassador has already been started for 18s now, creating an infinite crash loop).
The failure of the live probe seems to correspond with the scan of the k8s watches, when Ambassador wait for them to start. It just stops answering in the middle of the procedure. Here some logs at 59:23
Then it comes back up at 59:43, when all the watch are up to date :
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Ambassador should not say it dies during initialization when obviously it's not the case, it should just say it is not ready.
This is painfull because in the helm chart, the initial delay is set to 30s.
Our ambassador is a shared service for us, with a lot of apps on it, so it takes more time than that to start, and the live probe is able to catch the 503 errors described above.
Then kubernetes kills the pod and it's the infinite crash loop.
Versions
Additional context
We have a lot of services attached to this ambassador, and unfortunately it makes this bug bigger than it really is. The fact that ambassador has to load a huge amount of certificate doesn't help, but the security team in the company don't want to give wild cards, and we can't afford to manage one ambassador per team.