NYCPlanning / labs-geosearch-docker

Main repository for running the Planning Labs geosearch API powered by pelias
12 stars 3 forks source link

Research Geosearch Outages #74

Closed TylerMatteo closed 6 months ago

TylerMatteo commented 7 months ago

Use this Issue to research and identify the root cause of several outages we have been experiencing with our Geosearch API.

Geosearch (see docs site here) is a geocoding API that uses the open source geocoder Pelias and data from DCP's PAD dataset to provide fast address search to most of our frontend applications. When you look up an address in Zola, it is using Geosearch under the hood.

Check out the README for the primary Geosearch repo for a detailed explanation of how Geosearch is deployed. At a high level, Geosearch is deployed as a VM (aka "droplet") on Digital Ocean. That VM put behind a Digital Ocean load balancer, primarily for the purposes of having a static IP. The VM hosts several different applications that make up a running instance of Pelias via docker and docker compose.

Starting roughly last fall, we have been having outages with Geosearch where all network calls to the Geosearch API will suddenly start to fail, throwing a 400 status code, and a response that includes an errors array with "getaddrinfo EAI_AGAIN libpostal". libpostal is one of the services that makes up Pelias and, as such, is one of the containers defined in the compose yaml file. As we haven't had time to investigate in detail, our solution has been to just deploy a fresh instance of Geosearch to a new VM and point traffic there. However, we now want to investigate these outages so that we can implement a more sustainable fix.

The last outage occurred on 1/17. We resolved it by deploying a new instance of Geosearch. However, we left the previous, malfunctioning instance running so that we can investigate it. The engineer doing this research will be able to replicate the failing requests by sending API calls directly to the IP of that previous VM. Because this Issue will be public, I won't be including that IP here, but I can show the engineers doing this work how to get it directly from Digital Ocean. I can also show them how to SSH into that VM to conduct investigation.

I'll be available to help with this and talk through what Geosearch is and how it works more generally. My best guess so far as to what is going on here is that it may be related to the VM running out memory for some reason, possibly because of logs being written to disk within one of the containers. That kind of scenario would explain why new VMs work fine until they have been live for a couple months or so.

Acceptance Criteria: [] - Identify root cause of Geosearch outages [] - Document findings by opening a thread in the Discussion section of the ae-private repo. This discussion should also put forward potential solutions to be discussed and turned into Issues.

pratishta commented 7 months ago

Discussion here: https://github.com/NYCPlanning/ae-private/discussions/9

TylerMatteo commented 6 months ago

I just deployed @pratishta's changes to prod and cleaned up any unnecessary Geosearch VMs. We'll keep an eye on the disk usage of the new VM but we can consider the research and implementation of this issue resolved.