18F / api.data.gov

A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.
https://api.data.gov
Other
96 stars 43 forks source link

The Lua rollout plan #294

Closed GUI closed 8 years ago

GUI commented 8 years ago

We have a significant update to the API Umbrella platform we're going to be releasing: https://github.com/NREL/api-umbrella/pull/183. This issue is to coordinate how we're going to update the api.data.gov stack with this update.

GUI commented 8 years ago

I've completed the initial rollout for testing against the NREL APIs. So far, things have looked pretty good and no big issues (and lots of nice benefits like lower memory use, CPU use, etc). There have been a few small things crop up, mostly around edge cases that the live traffic has helped pinpoint (eg, things like geocoding results for analytics containing a city, but no state/region). So I've been fixing those issues and keeping an eye on traffic, but otherwise no big issues really impacting functionality.

We'll continue to monitor things, but I think we can reach out to agencies soon about planning the wider rollout.

GUI commented 8 years ago

As a quick update, I've been seeing some unexpected memory growth on the production system running the new stack. It's not happening super quickly, but it's something I've been looking into, so I wanted to make note of it for reference.

This memory growth didn't show up in the multi-day stress tests I ran, but I have a couple of theories as to what's going on with production now:

I'm hoping it's the first option, since that means we don't really have a memory leak, just a slowly-filling cache that does have an eventual cap. And there are some recent signs that maybe point towards that:

screen shot 2015-11-04 at 8 13 37 pm

I reloaded things at around 7AM, so that's the big drop-off, but the rate of increase does appear to be leveling off now. It's still increasing some, which I had sort of expected to stop by now based on some calculations, but we'll see how this looks tomorrow given another good chunk of hours. Prior to the 7AM reload, I also made some tweaks to better tune the default sizes of our shared memory dicts inside nginx, so that might also be helping.

GUI commented 8 years ago

Status update on the memory growth: Despite things appearing like they were leveling, off, the memory usage continued to grow. I think I've tracked it down to the geoip2 module. The memory growth was easily reproducible when making requests from many different IP addresses. I think this should be resolved by a switch to the geoip module that's builtin to nginx and uses the legacy dataset. Overall memory usage should also be improved by this switch after some deeper digging and testing. More details in this commit message: https://github.com/NREL/api-umbrella/commit/19f22834bc032c2b948dd18a17ecf0c06ab5dfe2

So we'll continue to keep our eyes peeled on that, but otherwise I think the plan is to announce the wider rollout for the week of November 16 and do a slow rollout that week to each agency domain.

GUI commented 8 years ago

A couple of status updates on the technical stuff:

And in terms of the general rollout, we announced our plans to rollout the changes to agencies this week. We're rolling things out to agencies one at a time on the following schedule:

GUI commented 8 years ago

Quick update for today: Things seem to be progressing well (knock on wood). The only notable issue discovered during the rollout this week was a pretty minor one. There was a bug that caused requests not to be logged in the analytics database if the request came from an IP address that geocoded to a city name that contained an accent or special character (Tórshavn, Faroe Islands is an example). This didn't affect a huge number of requests, but it has now been fixed.

GUI commented 8 years ago

The transition is fully complete! :star2: :star2: