Investigate High RDS IOPS Use, Production Service Interruption

rajadain commented 2 years ago

Since the deployment of 1.33, we're observing very high IOPS use of the RDS instance in production:

This is coincident with the site going down a lot:

And we see these messages in RDS:

Since this is happening after 1.33, it is highly likely it has something to do with the High Res Streams, which is a very large table introduced in that release. Also, the high IOPS use may be caused by a very high request rate, possibly caused by a large number of requests from the tiler.

It is possible that #3424 may help. Or batch-generation of tiles. Or transferring the tiles already made for staging to production. Or it may be something else entirely.

Investigate, and if possible, fix.

rajadain commented 2 years ago

To reduce the load of NHD hi res tile requests, I wrote this script:

import mercantile
import requests

for zoom in range(19):
    # bbox via https://gist.github.com/jsundram/1251783
    for tile in mercantile.tiles(-124.7844079,24.7433195,-66.9513812,49.3457868, zoom):
        url = f'http://33.33.34.35/nhd_streams_hr_v1/{zoom}/{tile.x}/{tile.y}.png'
        requests.get(url)
        print(url)

and then configured the tiler VM to point to (and have access to) the Production tile cache bucket (by configuring the MMW_TILECACHE_BUCKET environment variable, and adding credentials for mmw-prd as the default in /var/lib/mmw/.aws/credentials inside the tiler). Then I ran the script, which ran for 1d 29m 11s before I canceled it, with the final tile getting pre-cached being http://33.33.34.35/nhd_streams_hr_v1/15/7385/11671.png.

This means that all of zoom levels 0-14 for the entire CONUS have tiles pre-cached, and a partial cache for zoom 15 was also added. In most cases, zooming beyond level 15 doesn't result in as many stream reads, so the usage of the burst balance should be reduced.

The tile cache for this layer on production is currently ~30 GB in size:

rajadain commented 2 years ago

It should be noted that even after this, we did see at least one instance of the Burst Balance going down to 0:

Looking at the logs, we see queries like this that take a long time (40+ seconds) to run: long-query.sql. This is for the Saco HUC-8:

The final output is:

lengthkm	stream_order	slopesum
2.7824697770005598	-9
2777.4843327587737	1	174.8754808855135
1094.772595535565	2	33.29209307438515
760.0630001057596	3	9.209074630253456
284.7515363448878	4	1.5203409666684877
154.49228032277207	5	0.2766627177824275
83.42659346525942	6	0.07302188101247957
132.18007018219515	7	0.12741485218159052

Here's the query plan output of EXPLAIN ANALYZE:

Finalize GroupAggregate  (cost=353138.03..1375844.01 rows=11 width=18) (actual time=32163.046..55491.620 rows=8 loops=1)
  Group Key: nhdflowlinehr.stream_order
  ->  Gather Merge  (cost=353138.03..1375843.68 rows=22 width=18) (actual time=8306.590..55491.555 rows=23 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial GroupAggregate  (cost=352138.01..1374841.12 rows=11 width=18) (actual time=3116.023..35914.559 rows=8 loops=3)
              Group Key: nhdflowlinehr.stream_order
              ->  Sort  (cost=352138.01..352163.51 rows=10200 width=1027) (actual time=554.512..561.702 rows=3290 loops=3)
                    Sort Key: nhdflowlinehr.stream_order
                    Sort Method: external merge  Disk: 2888kB
                    Worker 0:  Sort Method: quicksort  Memory: 1644kB
                    Worker 1:  Sort Method: external merge  Disk: 3768kB
                    ->  Parallel Bitmap Heap Scan on nhdflowlinehr  (cost=874.15..346856.38 rows=10200 width=1027) (actual time=350.164..545.643 rows=3290 loops=3)
                          Filter: st_intersects(geom, '01060000...F2274640'::geometry)
Planning Time: 62.047 ms
JIT:
  Functions: 30
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 5.602 ms, Inlining 203.120 ms, Optimization 481.906 ms, Emission 324.416 ms, Total 1015.044 ms
Execution Time: 55494.523 ms

In this example, the inner query results in 9869 rows.

rajadain commented 2 years ago

Seeing 100% CPU use in RDS:

Turning off Celery doesn't seem to help. Installed pg_stat_statements as recommended in https://aws.amazon.com/premiumsupport/knowledge-center/rds-aurora-postgresql-high-cpu/ to investigate queries.

Other, non-analyze queries that take a long time are tile requests and text searches:

SELECT round(total_exec_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 3;

  total_time   |                                                                                      query
---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 272881959.314 | SELECT ST_AsBinary("geom") AS geom,"stream_order" FROM (SELECT geom, stream_order FROM nhdflowlinehr WHERE stream_order >= $1) as cdbq WHERE "geom" && ST_SetSRID($2::box3d, $3)
  125818041.86 | SELECT id, code, name, rank, ST_X(center) AS x, ST_Y(center) AS y                                                                                                               +
               |         FROM ((SELECT id, $1 AS code, name, $2 AS rank,                                                                                                                         +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc08                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($3)                                                                                                                                        +
               |         LIMIT $4) UNION ALL (SELECT id, $5 AS code, name, $6 AS rank,                                                                                                           +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc10                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($7)                                                                                                                                        +
               |         LIMIT $8) UNION ALL (SELECT id, $9 AS code, name, $10 AS rank,                                                                                                          +
               |             ST_Centroid(geom) as center                                                                                                                                         +
               |         FROM boundary_huc12                                                                                                                                                     +
               |         WHERE UPPER(name) LIKE UPPER($11)                                                                                                                                       +
               |         LIMIT $12)) AS subquery                                                                                                                                                 +
               |         ORDER BY rank DESC, name
  96767941.593 | WITH stream_intersection AS (                                                                                                                                                   +
               |             SELECT ST_Length(ST_Transform(                                                                                                                                      +
               |                       ST_Intersection(geom,                                                                                                                                     +
               |                                       ST_SetSRID(ST_GeomFromGeoJSON($1),                                                                                                        +
               |                                                  $2)),                                                                                                                          +
               |                       $3)) AS lengthm,                                                                                                                                          +
               |                    stream_order,                                                                                                                                                +
               |                    slope                                                                                                                                                        +
               |             FROM nhdflowlinehr                                                                                                                                                  +
               |             WHERE ST_Intersects(geom,                                                                                                                                           +
               |                                 ST_SetSRID(ST_GeomFromGeoJSON($4), $5)))                                                                                                        +
               |                                                                                                                                                                                 +
               |         SELECT SUM(lengthm) / $6 AS lengthkm,                                                                                                                                   +
               |                stream_order,                                                                                                                                                    +
               |                SUM(lengthm * NULLIF(slope, $7)) / $8 AS slopesum                                                                                                                +
               |         FROM stream_intersection                                                                                                                                                +
               |         GROUP BY stream_order
(3 rows)

rajadain commented 2 years ago

This does make it look like the stream intersections are the ones taking the longest time:

SELECT pid, age(clock_timestamp(), query_start), usename, query, state
FROM pg_stat_activity
ORDER BY query_start DESC;

rajadain commented 2 years ago

Since CPU usage has been high for more than an hour now, I'm going to try and kill some of those queries.

rajadain commented 2 years ago

Ah those queries just finished on their own. And we see a corresponding drop in CPU use:

rajadain commented 2 years ago

Anthony mentioned that upping the size or type of the RDS instance may help in the short term. We should keep that in mind if our optimizations do not help.

WikiWatershed / model-my-watershed

Investigate High RDS IOPS Use, Production Service Interruption #3465