API unresponsive - Githubissues

kamicut commented 7 years ago

The API is not responsive to certain requests. This could be an optimization issue at the database level or server level.

Strategies to fix:

[ ] Try a bigger instance
[ ] Try raw database queries
[ ] Profile the current queries and paginate (requires changes to the server and frontend)

cc @dalekunce @matthewhanson @ianschuler

matthewhanson commented 7 years ago

This appears to have been an intermittent problem. Earlier the API was hanging, and when I tried direct queries to the DB they took a very long time to respond. Now, the site is working again, and DB queries are instant.

We thought that perhaps there was a database issue due to the number of UPDATE queries we make, this can cause problems with databases over time and require regular VACUUMing. Normally this is a db maintenance task that is done automatically, and it could be that it ran and fixed the issue. We will look into the logs to see if there was anything that ran over the last few hours.

Alternately, the problem could have been due to a period of extreme activity, although I'm not away that we've had any such major problems during big mapathons that shuts down the whole site, so that seems unlikely.

It also could have been due to an intermittent AWS problem, which will be impossible to confirm unless they announced problems.

We will leave this ticket open while we do a post-mortem on the problem and see if there's any mitigation strategies that could be adopted.

cc @dalekunce @kamicut @ianschuler

matthewhanson commented 7 years ago

The hanging happened again, and continues to happen intermittently. A restart of the API container always fixes it, for a time. If we make 100 sequential calls to the hashtag endpoint it starts off fine, then just blows up and starts hanging. See plot. If the container is restarted another 100 calls shows similar behavior.

@kamicut is investigating the API code.

$ for i in $(seq 1 100); do (time curl -s osmstats.redcross.org/hashtags) 2>&1 >/dev/null | grep real; done

cc @dalekunce @ianschuler

matthewhanson commented 7 years ago

And I partially take back what I said above.

Last night this happened reliably. This morning I can make 100 calls and all are fine.

kamicut commented 7 years ago

@matthewhanson are we tracking inbound requests? At @dalekunce's suggestion we could look to see what API calls are being done and if there's any bot pattern.

matthewhanson commented 7 years ago

@kamicut I think that's a good idea, I think it's tied to some request(s) being made. We are not logging that AFAIK.

cc @dalekunce

dalekunce commented 7 years ago

@kamicut @matthewhanson thanks for your work on this, ideas as to what is going on, its still broken.

matthewhanson commented 7 years ago

@dalekunce I'm working on creating a new DB view that will hold the stats so they aren't calculated for each request. If the problem is due to that calculation taking too long and holding up the API this could fix it. I'll post an update later.

matthewhanson commented 7 years ago

@dalekunce @kamicut @ianschuler

Here's what appears to be happening.

The API isn't down, just extremely slow, which can vary depending on # of requests. The API has been getting slower over time because of the endpoint that calculates stats based on hashtag. When we started this thing, it was no big deal, but as not only the # of changesets grew and we opened it to all hashtags this bit of DB query has become unacceptably slow.

VACUUM ANALYZE appeared to do little compared to our latencies (also note, the VACUUM ANALYZE schedule should be set, it's not running on all tables). Setting up a MATERIALIZED view containing the aggregates by hashtag that can be refreshed with new changesets also will not work. The initial query to make it is taking too long, and evidently there's no way to selectively update rows in a MATERIALIZED view.

So, what needs to be done is the hashtag table needs to have stats added to it, like the user table. When we update the metrics for the user with a new changeset, we also need to update the metrics for each given hashtag. This happens in osm-stats-workers and offloads the expensive calculation from the API.

matthewhanson commented 7 years ago

Actually, I just checked on my running API tests and things are a lot more consistent now, I'm not seeing the slowdowns like I was before.

The initial query on the leaderboards still takes a while, but it appears it may be coming up quicker then before, and after several refreshes I'm not seeing a timeout. I'll keep checking over the evening.

If it's somewhat stable now it could have been due to the manual vacuum. We'll make sure we set up an appropriate vacuum schedule. However over time the latency could get worse for the leaderboards and I'd recommend the changes above in osm-stats-workers.

cc @dalekunce @kamicut @ianschuler

matthewhanson commented 7 years ago

@dalekunce

We apologize for the extended downtime, especially this afternoon. I added some logging to the API and spent some time in the database analyzing queries to identify the most time consuming pieces, created indexes where appropriate. At some point during the starting, stopping, and recreating of the docker containers something broke and there were problems with the API talking to the other docker containers (planet-stream, redis, forgettable). The leaderboards were brought down completely and @kamicut and I ended up upgrading docker, docker-compose, and configuration in order to get them all working happily together again.

The leaderboards are now working again, and the new indexes appear to make the whole site faster as well. I don't use it regularly so it would be good to get your experience with it @dalekunce . It seemed to me that direct calls to the API were faster, getting all the missingmaps users only takes about 10 seconds. On the site, the leaderboards load in 10s or less, and a user page takes a few seconds.

cc @ianschuler

dalekunce commented 7 years ago

@matthewhanson @kamicut thanks for sorting out all the issues, who knew keeping years of OSM edits would become so big 😄 . Upgrading things is a good next step to our future plans and I'm glad you took advantage of the existing downtime to get things running correctly. Let's plan to update the real numbers in the coming weeks, either by rerunning stream since 2014 or updating using Athena numbers.

AmericanRedCross / osm-stats

API unresponsive #41