18F / api.data.gov

A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.
https://api.data.gov
Other
96 stars 43 forks source link

Investigate memory leak #296

Closed GUI closed 7 years ago

GUI commented 8 years ago

Things have gone pretty smoothly with the rollout of the new stack with the exception of a memory leak that still seems to be hanging around. I discussed it some in the rollout issue, but since this that's complete and the memory issue is still unfortunately persisting, I wanted to create a separate story to track this.

The gist of the problem is that the nginx worker processes in the stack are slowly leaking memory in production (which obviously isn't good). The bigger issue is that I'm having a rather difficult time reproducing the problem in any type of controlled environment which is making it exceedingly hard to debug. Since I can't replicate the conditions, I've run systemtap tools on the live production servers, but even those doesn't seem to show anything useful.

We need to get to the bottom of this, but I will mention that the memory growth can be alleviated by an nginx SIGHUP reload (which shouldn't incur any downtime), so it is pretty easy to bandaid the issue in an ugly way by reloading nginx every so often. In addition, after a few days of leaking memory, the new stack is still consuming less memory than the old stack, so the leak isn't super-severe and our servers aren't in critical danger of running out of memory unexpectedly. So those were the main reasons I still felt comfortable pushing forward with the new stack in production and having this memory leak be a known issue.

In terms of reproducing this issue in a more controlled, non-production environment, I've tried all sorts of combinations of request and response types to seemingly no avail. It's possible I'm not striking on the right combination or my local tests are somehow flawed, but here's a list of things I've tried hammering the server with locally without seeing the memory growth (I've let most of these run for several hours with more traffic than we see on production):

Here's a couple more things I can think to try after writing out that list:

I'm going to continue debugging and exploring this, but in the meantime, I'm also planning on adding script we can enable for the ugly bandaid fix which will reload nginx every so often. That will mostly sidestep the issue, but at some point hopefully we'll get to the bottom of the real issue.

GUI commented 7 years ago

We still see some slow memory growth, so while this would be nice to get to the bottom of eventually, our solution of reloading nginx is integrated into API Umbrella and working fine. So closing this, since this isn't really a priority.