Ambassador upgrade caused an increase in memory use(x8)

eroznik commented 3 years ago

Describe the bug On our Ambassador API Gateway deploy we noticed a severe memory use increase(approximately 8x) when we upgraded the Helm chart version from 6.5.10 -> 6.6.2(Ambassador version from 1.8.1 to 1.12.2). After we reviewed our setup and checked various changelog/docs we found the env config AMBASSADOR_LEGACY_MODE, after setting its value to true, the memory dropped back to "normal/before the upgrade".

To Reproduce Steps to reproduce the behavior:

Upgrade Ambassador to version 1.12.2

Expected behavior Memory use shouldn't increase if AMBASSADOR_LEGACY_MODE remains to false

Versions (please complete the following information):

Ambassador: 1.12.2
Kubernetes environment: AWS KOPS setup
Version: v1.18.14

Additional context More information about our setup can be provided as needed, but for starters:

currently we have ~36 mappings
pod limits are set to 1cpu & 512mb RAM (these resources are ok for "normal/expected" mem use, if the feature flag is set to "false" we had to bump ram to 2gb)
we have auto scaling enabled, minimum 3 replicas, max 8

guongle-ssense commented 3 years ago

Hey I have encountered the same jump in memory. It used to be 250mb -> 400mb MAX with roughly the same amount of mapping. ever since 1.11.0 (which raised CPU) and then 1.12.2 - memory is out of wack.

I checked into the community and it was suggested to do the following:

Ambassador since 1.10/1.11 requires considerably more memory than it did before, having to do with more safety checks, validating the config, and drastically speeding up validation time to push the config to envoy. Could you try setting prune_unreachable_routes to true in the Module and see if it helps? This shrinks the size of the envoy config with the caveat that unfortunately you won’t gain any benefit for regex hosts.
spec:
config:
prune_unreachable_routes: true

I am currently testing this solution out.

I have some members of my team check this out also and it was noticed that everytime there is a new bump, the helm files for configuration has changed (added etc) and might have caused this

I would like to know eventually if this is going to be normal (big pods) or will they acknowledge this as a bug.

dzkaraka commented 3 years ago

To continue on the same issue. After testing prune_unreachable_routes: true, the problem persists. Would you suggest any other solution? Note: closed https://github.com/datawire/ambassador/issues/3414 for the time being as the issue is similar.

esmet commented 3 years ago

If you could provide a snapshot of the processes running inside an Ambassador container after the upgrade, it will help us determine whether the memory usage is abnormal or expected.

Generally speaking, I would think that an Ambassador pod would appreciate a ~1gb memory limit to have plenty of breathing room for validating Envoy configs and managing the control plane in memory.

guongle-ssense commented 3 years ago

hi @esmet I'm inside the pod itself, where (or what) would it look like?

eroznik commented 3 years ago

@esmet something like this? this is a screenshot i found that i made after the upgrade

SimonOlenik commented 3 years ago

@dzkaraka @guongle-ssense We solved our issues when we set AMBASSADOR_LEGACY_MODE env variable to TRUE. Then new version behaves the same as old version. But this is not appropriate solution, but it works for us as tmp solution.

BR

guongle-ssense commented 3 years ago

@dzkaraka @guongle-ssense We solved our issues when we set AMBASSADOR_LEGACY_MODE env variable to TRUE. Then new version behaves the same as old version. But this is not appropriate solution, but it works for us as tmp solution.

BR

Thanks for the input, but seeing that from the official doc it is not recommended I'm a bit iffy on it, ill keep on poking for perma solution

worse case, at least its been validated that it does help

esmet commented 3 years ago

Thanks for posting that information. From the top output I can't tell if the busyambassador process that has a 1600Mb virtual size also has a high resident size. Try htop?

In general, depending on how much memory that process is using, I can say that Ambassador using more memory is now fairly normal. I would imagine that the previous 512Mb wouldn't be enough.

eroznik commented 3 years ago

@esmet we'll try to provide better info through htop, as requested.. may you please provide some "expected estimates" on how much ram/cpu does the "new" version of Ambassador use? On the previous version(or with the legacy mode turned on) we have Ambassador pods with 1CPU & 512mb RAM running just fine with 45 mappings registred.

esmet commented 3 years ago

There's no hard and fast rule unfortunately. Are you using Endpoint routing by any chance?

guongle-ssense commented 3 years ago

Here are container and top view of the containers operating for us @esmet

3a8cdce1-5f8c-4275-aa60-ce131d465b86 5f06dd17-436f-4158-af50-a6d2b7b362b2

eroznik commented 3 years ago

@esmet , in our case we're using the default routing resolver, we don't have any over-rides

yogendratamang48 commented 3 years ago

@esmet we are also seeing this issue. This is our htop.

Our configuration:

CHART: 6.9.1
Image: docker.io/datawire/ambassador:1.14.1
Hosts: 1
Mappings: 7 (all constrained to host)
No TLSContext

Other configurations: AMBASSADOR_FAST_RECONFIGURE: "true" AMBASSADOR_DRAIN_TIME: 5 AMBASSADOR_AMBEX_SNAPSHOT_COUNT: 0

And this is our container memory usage in last 1 hour

DaveOHenry commented 1 month ago

Our emissary ingress pods were serving 76 mappings and mostly needed less than 1GB of memory in the past, but at some point the memory usage increased more and more. Currently it is peaking at ~10GB before a pod is oomkilled. This caused us a lot of trouble. Looks like something is going wrong inside emissary during startup and also when applications are scaled up or down.

Tried out all the suggested config changes, disabled metrics, upgraded the version, downgraded again, but nothing helped. Unfortunately emissary is basically unusable for us. The only working solution so far is to switch to another ingress controller like haproxy or nginx.

emissary-ingress / emissary

Ambassador upgrade caused an increase in memory use(x8) #3329