Investigate memory leak

Things have gone pretty smoothly with the rollout of the new stack with the exception of a memory leak that still seems to be hanging around. I discussed it some in the rollout issue, but since this that's complete and the memory issue is still unfortunately persisting, I wanted to create a separate story to track this.

The gist of the problem is that the nginx worker processes in the stack are slowly leaking memory in production (which obviously isn't good). The bigger issue is that I'm having a rather difficult time reproducing the problem in any type of controlled environment which is making it exceedingly hard to debug. Since I can't replicate the conditions, I've run systemtap tools on the live production servers, but even those doesn't seem to show anything useful.

We need to get to the bottom of this, but I will mention that the memory growth can be alleviated by an nginx SIGHUP reload (which shouldn't incur any downtime), so it is pretty easy to bandaid the issue in an ugly way by reloading nginx every so often. In addition, after a few days of leaking memory, the new stack is still consuming less memory than the old stack, so the leak isn't super-severe and our servers aren't in critical danger of running out of memory unexpectedly. So those were the main reasons I still felt comfortable pushing forward with the new stack in production and having this memory leak be a known issue.

In terms of reproducing this issue in a more controlled, non-production environment, I've tried all sorts of combinations of request and response types to seemingly no avail. It's possible I'm not striking on the right combination or my local tests are somehow flawed, but here's a list of things I've tried hammering the server with locally without seeing the memory growth (I've let most of these run for several hours with more traffic than we see on production):

Requests from lots of unique API keys
Requests from lots of unique IPs
Requests from lots of unique user agents
Small response bodies
Large response bodies (streaming and not)
Client keep-alive support
gzip and non-gzipped responses
Lots of API backends with lots of DNS changes happening in the background (I actually loaded all our production API backends locally)
Lots of requests against an API backend with constantly changing IP addresses due to DNS changes
Requests that go to a dead backend server
Requests that exceed rate limits and are rejected

Here's a couple more things I can think to try after writing out that list:

Various keep-alive settings on the API backend servers
POSTing large request bodies
Double check our usage of some lua globals: https://github.com/NREL/api-umbrella/blob/master/.luacheckrc (I did try to verify user_agent_parser_data by testing with lots of unique user agents, but we should ensure config and elasticsearch_templates can only ever be set once during startup)
Double check the distributed rate limit "queue" object that gets copied and cleared: https://github.com/NREL/api-umbrella/blob/master/src/api-umbrella/proxy/distributed_rate_limit_queue.lua (I did try to investigate this already, but systemtap tools on production don't report any Lua-side memory or object growth, and everything seems to be garbage collected as expected, so that's what makes me think it's perhaps not our Lua code and instead our interaction with some nginx C module--although, the systemtap malloc tools also don't seem to point to any issues with the C code, so who knows).

I'm going to continue debugging and exploring this, but in the meantime, I'm also planning on adding script we can enable for the ugly bandaid fix which will reload nginx every so often. That will mostly sidestep the issue, but at some point hopefully we'll get to the bottom of the real issue.

18F / api.data.gov

Investigate memory leak #296