emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.36k stars 684 forks source link

statsd container leaking (0.28.0) #315

Closed mradmila closed 6 years ago

mradmila commented 6 years ago

Image: quay.io/datawire/statsd:0.28.0

There are 20 restarts in 3 days, all due to statsd. Luckily Envoy/Ambassador is fine: po/ambassador-2434898577-662m7 2/2 Running 21 3d po/ambassador-2434898577-7z156 2/2 Running 22 3d po/ambassador-2434898577-rp04l 2/2 Running 23 3d

Logs: 13 Mar 11:36:34 - [1] reading config file: config.js 13 Mar 11:36:34 - server is up INFO

<--- Last few GCs --->

[1:0x558e2116c9a0] 18229956 ms: Mark-sweep 1400.0 (1427.0) -> 1400.0 (1428.0) MB, 1484.2 / 11.4 ms allocation failure GC in old space requested [1:0x558e2116c9a0] 18231481 ms: Mark-sweep 1400.0 (1428.0) -> 1400.0 (1427.0) MB, 1525.2 / 11.3 ms last resort [1:0x558e2116c9a0] 18232971 ms: Mark-sweep 1400.0 (1427.0) -> 1400.0 (1427.0) MB, 1489.8 / 11.0 ms last resort

<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

mradmila commented 6 years ago

Sigh, this should NEVER be the case:

ambassador-2434898577-7z156 statsd 8m 2418Mi ambassador-2434898577-7z156 ambassador 7m 93Mi

Ambassador (the thing I want) uses 93 MB, and statsd uses 2.5 GB :(

richarddli commented 6 years ago

Thanks! What kind of load are you running? If you're in a position to test a few obvious fixes, the statsd container is very simple (source here: https://github.com/datawire/ambassador/blob/develop/statsd/).

We could try to upgrade the base Alpine image to mhart/alpine-node:9?

I looked at the default config.js options (https://github.com/etsy/statsd/blob/master/exampleProxyConfig.js) and it doesn't seem like there are any obvious missing config errors.

Are you actively using the statsd container? Any other ideas?

mradmila commented 6 years ago

Sorry, got swamped with other things. I will try new alpine-node (actually I've build the Docker container and had some Python related errors that don't give me warm feeling about this). For now I nuked the statsd container from the deployment because I don't want to pay 2.5 GB in monitoring for 93 MB of useful compute. As horrible as that sounds (not monitoring Envoy), it's the right thing to do until statsd container is stable. Are you able to repro this? I am getting this on a blank Ambassador, no traffic whatsoever...

kflynn commented 6 years ago

...wait, which Python errors, and which Docker container?

Ambassador 0.29.0 will be out shortly; it has the upgraded Alpine node, and we'd be very happy if you can try that.

kflynn commented 6 years ago

("out shortly" == "I'm running final test on it before shipping".)

mradmila commented 6 years ago

statsd Docker container build complains about Python not being set up. It really wanted to rebuild some nodejs code, but could not (did not dive deep into it). Here's the error (which does not fail Docker build, btw):

gyp ERR! configure error gyp ERR! stack Error: Can't find Python executable "python", you can set the PYTHON env variable. gyp ERR! stack at PythonFinder.failNoPython (/usr/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:483:19) gyp ERR! stack at PythonFinder. (/usr/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:397:16) gyp ERR! stack at F (/usr/lib/node_modules/npm/node_modules/which/which.js:68:16) gyp ERR! stack at E (/usr/lib/node_modules/npm/node_modules/which/which.js:80:29) gyp ERR! stack at /usr/lib/node_modules/npm/node_modules/which/which.js:89:16 gyp ERR! stack at /usr/lib/node_modules/npm/node_modules/which/node_modules/isexe/index.js:42:5 gyp ERR! stack at /usr/lib/node_modules/npm/node_modules/which/node_modules/isexe/mode.js:8:5 gyp ERR! stack at FSReqWrap.oncomplete (fs.js:170:21) gyp ERR! System Linux 4.13.0-32-generic gyp ERR! command "/usr/bin/node" "/usr/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild" gyp ERR! cwd /application/node_modules/modern-syslog gyp ERR! node -v v9.8.0 gyp ERR! node-gyp -v v3.6.2 gyp ERR! not ok npm WARN saveError ENOENT: no such file or directory, open '/application/package.json' npm notice created a lockfile as package-lock.json. You should commit this file. npm WARN enoent ENOENT: no such file or directory, open '/application/package.json' npm WARN application No description npm WARN application No repository field. npm WARN application No README data npm WARN application No license field. npm WARN optional SKIPPING OPTIONAL DEPENDENCY: modern-syslog@1.1.2 (node_modules/modern-syslog): npm WARN optional SKIPPING OPTIONAL DEPENDENCY: modern-syslog@1.1.2 install: node-gyp rebuild npm WARN optional SKIPPING OPTIONAL DEPENDENCY: Exit status 1

mradmila commented 6 years ago

This is during:

Step 3/5 : RUN npm install https://api.github.com/repos/etsy/statsd/tarball/8d5363c ---> Running in bcd043dcf306

modern-syslog@1.1.2 install /application/node_modules/modern-syslog node-gyp rebuild

richarddli commented 6 years ago

I'm going to try https://github.com/bitly/statsdaemon. Need a load testing setup (most likely) to repro the memory leak though. @mradmila are there any general stats you can share about your load that would help in repro situation here?

mradmila commented 6 years ago

There is no load, which is the sad part. I've used it (Envoy/Ambassador) for 5 mins, verified it routed properly and left the cluster alone. I had 10 routes with gRPC, routing based on custom HTTP header value. I came back after the weekend and saw ~25 restarts, all related to statsd. Every time I looked at top for that container, it was never using less than 900MB. This has a 100% repro for me, as I tried with other clusters. This is on Azure AKS and ACS, if that matters (I doubt it).

richarddli commented 6 years ago

I set up a 3 node test cluster in GKE, with 3 HTTP routes. It's been 17 hours, and I see that the statsd container is hovering around 28MB. Will leave it running for another few days to see if it grows.

Note: this is with statsd 0.29 where we've upgraded the base statsd image to alpine-node:9.

mradmila commented 6 years ago

Great, I will give it a shot today. If I get different results, I will reply with more specific details on my deployment.

mradmila commented 6 years ago

I've tried 0.29.0, and watched statsd container from inside using top. It gains a MB every few seconds. It is currently at ~650 MB. I will send you more details on the deployment, but it may have to be on Monday.

mradmila commented 6 years ago

Even without any routes the memory is climbing up. This is just a few seconds after starting up Ambassador:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 1 0 root S 625m 1% 1 0% statsd config.js 16 0 root S 1588 0% 1 0% sh 24 16 root R 1524 0% 0 0% top

richarddli commented 6 years ago

OK, I see how you're getting to this. I get the same thing (630m for me) when I exec top on the statsd container. I was doing this:

test (⎈ |gke_datawireio_us-central1-a_rdl-amb-lt-test:default)$ kubectl top pod ambassador-79ff548968-9sxrk --containers
POD                           NAME         CPU(cores)   MEMORY(bytes)
ambassador-79ff548968-9sxrk   ambassador   7m           90Mi
ambassador-79ff548968-9sxrk   statsd       28m          29Mi

I haven't seen any restarts, though. Will keep it running.

richarddli commented 6 years ago

OK, here is what I see:

$ kubectl top pod ambassador-79ff548968-lvq86 --containers
POD                           NAME         CPU(cores)   MEMORY(bytes)
ambassador-79ff548968-lvq86   ambassador   6m           102Mi
ambassador-79ff548968-lvq86   statsd       30m          28Mi

$kubectl get pods
NAME                          READY     STATUS    RESTARTS   AGE
ambassador-79ff548968-9sxrk   2/2       Running   0          3d
ambassador-79ff548968-lvq86   2/2       Running   0          3d
ambassador-79ff548968-vp65m   2/2       Running   0          3d
richarddli commented 6 years ago

However, an exec to the pod shows:

Mem: 1587516K used, 2206844K free, 6248K shrd, 150528K buff, 849608K cached
CPU:  10% usr  10% sys   0% nic  80% idle   0% io   0% irq   0% sirq
Load average: 0.20 0.62 0.53 4/302 28
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 root     S     644m  17%   0   0% statsd config.js

Regardless, though, no restarts.

richarddli commented 6 years ago

I still can't reproduce, although we've seen other reports of this issue, so I'm sure it occurs ;-). I investigated some different approaches to solving this issue; summarizing below.

The statsd container functions as a UDP repeater, since Envoy can only route statsd packets to a pre-determined IP address. Thus, we flush statsd packets to localhost, and then have the statsd sidecar repeat the UDP packets to statsd-sink.

We can:

  1. Switching to an alternative statsd implementation. The best one I've found is Bitly's golang port, but not sure if it can do the repeater functionality.
  2. Use a plain ol' UDP repeater. I couldn't find one in Google that seemed any good.
  3. Switch to using TCP for statsd, where Envoy does support DNS-addressable routing. This would require a rework of the statistics/monitoring section. I haven't done an exhaustive search, but it seems that TCP-based statsd is not as well supported as UDP.
richarddli commented 6 years ago

@mradmila OK, I have an alternative implementation that uses socat. If you're willing to give it a test, I've got a test image up. You should be able to replace the statsd image in the standard deployment YAML with image: richarddli/socat:3.

mradmila commented 6 years ago

Thanks... sorry I was quiet, I am heads down on something this week. I will try this on Monday, maybe tomorrow, if all goes well. Thanks for looking into this. I am surprised you are not hitting the same issue? For me this is 100% repro. The only variable perhaps is Azure/K8s version. Will let you know once I try.

richarddli commented 6 years ago

Super, thanks. I would really like to try to find a reproducer to get into our test suite. We've also had reports from other users who are not on Azure, so I don't think it's Azure. I'm on GKE, running 1.8.7.

mradmila commented 6 years ago

Looks much better with socat. Basically, no wasted memory whatsoever.

danilina-wsib commented 6 years ago

Hi, I have the same problem on Azure/AKS. Ambassador 0.29, statsd container leaking memory and has high CPU on an idle system

richarddli commented 6 years ago

The fix is on the develop branch and should be in the next release of Ambassador. If this is an issue for you right now, please try the socat image referenced above (richarddli/socat:3) and let us know how it goes.