Closed mradmila closed 6 years ago
Sigh, this should NEVER be the case:
ambassador-2434898577-7z156 statsd 8m 2418Mi ambassador-2434898577-7z156 ambassador 7m 93Mi
Ambassador (the thing I want) uses 93 MB, and statsd uses 2.5 GB :(
Thanks! What kind of load are you running? If you're in a position to test a few obvious fixes, the statsd container is very simple (source here: https://github.com/datawire/ambassador/blob/develop/statsd/).
We could try to upgrade the base Alpine image to mhart/alpine-node:9
?
I looked at the default config.js
options (https://github.com/etsy/statsd/blob/master/exampleProxyConfig.js) and it doesn't seem like there are any obvious missing config errors.
Are you actively using the statsd
container? Any other ideas?
Sorry, got swamped with other things. I will try new alpine-node (actually I've build the Docker container and had some Python related errors that don't give me warm feeling about this). For now I nuked the statsd container from the deployment because I don't want to pay 2.5 GB in monitoring for 93 MB of useful compute. As horrible as that sounds (not monitoring Envoy), it's the right thing to do until statsd container is stable. Are you able to repro this? I am getting this on a blank Ambassador, no traffic whatsoever...
...wait, which Python errors, and which Docker container?
Ambassador 0.29.0 will be out shortly; it has the upgraded Alpine node, and we'd be very happy if you can try that.
("out shortly" == "I'm running final test on it before shipping".)
statsd Docker container build complains about Python not being set up. It really wanted to rebuild some nodejs code, but could not (did not dive deep into it). Here's the error (which does not fail Docker build, btw):
gyp ERR! configure error
gyp ERR! stack Error: Can't find Python executable "python", you can set the PYTHON env variable.
gyp ERR! stack at PythonFinder.failNoPython (/usr/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:483:19)
gyp ERR! stack at PythonFinder.node-gyp rebuild
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: Exit status 1
This is during:
Step 3/5 : RUN npm install https://api.github.com/repos/etsy/statsd/tarball/8d5363c ---> Running in bcd043dcf306
modern-syslog@1.1.2 install /application/node_modules/modern-syslog node-gyp rebuild
I'm going to try https://github.com/bitly/statsdaemon. Need a load testing setup (most likely) to repro the memory leak though. @mradmila are there any general stats you can share about your load that would help in repro situation here?
There is no load, which is the sad part. I've used it (Envoy/Ambassador) for 5 mins, verified it routed properly and left the cluster alone. I had 10 routes with gRPC, routing based on custom HTTP header value. I came back after the weekend and saw ~25 restarts, all related to statsd. Every time I looked at top for that container, it was never using less than 900MB. This has a 100% repro for me, as I tried with other clusters. This is on Azure AKS and ACS, if that matters (I doubt it).
I set up a 3 node test cluster in GKE, with 3 HTTP routes. It's been 17 hours, and I see that the statsd
container is hovering around 28MB. Will leave it running for another few days to see if it grows.
Note: this is with statsd 0.29 where we've upgraded the base statsd image to alpine-node:9
.
Great, I will give it a shot today. If I get different results, I will reply with more specific details on my deployment.
I've tried 0.29.0, and watched statsd container from inside using top. It gains a MB every few seconds. It is currently at ~650 MB. I will send you more details on the deployment, but it may have to be on Monday.
Even without any routes the memory is climbing up. This is just a few seconds after starting up Ambassador:
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND 1 0 root S 625m 1% 1 0% statsd config.js 16 0 root S 1588 0% 1 0% sh 24 16 root R 1524 0% 0 0% top
OK, I see how you're getting to this. I get the same thing (630m for me) when I exec top
on the statsd
container. I was doing this:
test (⎈ |gke_datawireio_us-central1-a_rdl-amb-lt-test:default)$ kubectl top pod ambassador-79ff548968-9sxrk --containers
POD NAME CPU(cores) MEMORY(bytes)
ambassador-79ff548968-9sxrk ambassador 7m 90Mi
ambassador-79ff548968-9sxrk statsd 28m 29Mi
I haven't seen any restarts, though. Will keep it running.
OK, here is what I see:
$ kubectl top pod ambassador-79ff548968-lvq86 --containers
POD NAME CPU(cores) MEMORY(bytes)
ambassador-79ff548968-lvq86 ambassador 6m 102Mi
ambassador-79ff548968-lvq86 statsd 30m 28Mi
$kubectl get pods
NAME READY STATUS RESTARTS AGE
ambassador-79ff548968-9sxrk 2/2 Running 0 3d
ambassador-79ff548968-lvq86 2/2 Running 0 3d
ambassador-79ff548968-vp65m 2/2 Running 0 3d
However, an exec to the pod shows:
Mem: 1587516K used, 2206844K free, 6248K shrd, 150528K buff, 849608K cached
CPU: 10% usr 10% sys 0% nic 80% idle 0% io 0% irq 0% sirq
Load average: 0.20 0.62 0.53 4/302 28
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 644m 17% 0 0% statsd config.js
Regardless, though, no restarts.
I still can't reproduce, although we've seen other reports of this issue, so I'm sure it occurs ;-). I investigated some different approaches to solving this issue; summarizing below.
The statsd
container functions as a UDP repeater, since Envoy can only route statsd
packets to a pre-determined IP address. Thus, we flush statsd packets to localhost, and then have the statsd
sidecar repeat the UDP packets to statsd-sink
.
We can:
statsd
implementation. The best one I've found is Bitly's golang port, but not sure if it can do the repeater functionality.statsd
, where Envoy does support DNS-addressable routing. This would require a rework of the statistics/monitoring section. I haven't done an exhaustive search, but it seems that TCP-based statsd
is not as well supported as UDP.@mradmila OK, I have an alternative implementation that uses socat
. If you're willing to give it a test, I've got a test image up. You should be able to replace the statsd
image in the standard deployment YAML with image: richarddli/socat:3
.
Thanks... sorry I was quiet, I am heads down on something this week. I will try this on Monday, maybe tomorrow, if all goes well. Thanks for looking into this. I am surprised you are not hitting the same issue? For me this is 100% repro. The only variable perhaps is Azure/K8s version. Will let you know once I try.
Super, thanks. I would really like to try to find a reproducer to get into our test suite. We've also had reports from other users who are not on Azure, so I don't think it's Azure. I'm on GKE, running 1.8.7.
Looks much better with socat. Basically, no wasted memory whatsoever.
Hi, I have the same problem on Azure/AKS. Ambassador 0.29, statsd container leaking memory and has high CPU on an idle system
The fix is on the develop
branch and should be in the next release of Ambassador. If this is an issue for you right now, please try the socat
image referenced above (richarddli/socat:3
) and let us know how it goes.
Image: quay.io/datawire/statsd:0.28.0
There are 20 restarts in 3 days, all due to statsd. Luckily Envoy/Ambassador is fine: po/ambassador-2434898577-662m7 2/2 Running 21 3d po/ambassador-2434898577-7z156 2/2 Running 22 3d po/ambassador-2434898577-rp04l 2/2 Running 23 3d
Logs: 13 Mar 11:36:34 - [1] reading config file: config.js 13 Mar 11:36:34 - server is up INFO
<--- Last few GCs --->
[1:0x558e2116c9a0] 18229956 ms: Mark-sweep 1400.0 (1427.0) -> 1400.0 (1428.0) MB, 1484.2 / 11.4 ms allocation failure GC in old space requested [1:0x558e2116c9a0] 18231481 ms: Mark-sweep 1400.0 (1428.0) -> 1400.0 (1427.0) MB, 1525.2 / 11.3 ms last resort [1:0x558e2116c9a0] 18232971 ms: Mark-sweep 1400.0 (1427.0) -> 1400.0 (1427.0) MB, 1489.8 / 11.0 ms last resort
<--- JS stacktrace --->
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory