Closed rajadain closed 2 years ago
To reduce the load of NHD hi res tile requests, I wrote this script:
import mercantile
import requests
for zoom in range(19):
# bbox via https://gist.github.com/jsundram/1251783
for tile in mercantile.tiles(-124.7844079,24.7433195,-66.9513812,49.3457868, zoom):
url = f'http://33.33.34.35/nhd_streams_hr_v1/{zoom}/{tile.x}/{tile.y}.png'
requests.get(url)
print(url)
and then configured the tiler
VM to point to (and have access to) the Production tile cache bucket (by configuring the MMW_TILECACHE_BUCKET
environment variable, and adding credentials for mmw-prd
as the default
in /var/lib/mmw/.aws/credentials
inside the tiler
). Then I ran the script, which ran for 1d 29m 11s before I canceled it, with the final tile getting pre-cached being http://33.33.34.35/nhd_streams_hr_v1/15/7385/11671.png
.
This means that all of zoom levels 0-14 for the entire CONUS have tiles pre-cached, and a partial cache for zoom 15 was also added. In most cases, zooming beyond level 15 doesn't result in as many stream reads, so the usage of the burst balance should be reduced.
The tile cache for this layer on production is currently ~30 GB in size:
It should be noted that even after this, we did see at least one instance of the Burst Balance going down to 0:
Looking at the logs, we see queries like this that take a long time (40+ seconds) to run: long-query.sql. This is for the Saco HUC-8:
The final output is:
lengthkm | stream_order | slopesum |
---|---|---|
2.7824697770005598 | -9 | |
2777.4843327587737 | 1 | 174.8754808855135 |
1094.772595535565 | 2 | 33.29209307438515 |
760.0630001057596 | 3 | 9.209074630253456 |
284.7515363448878 | 4 | 1.5203409666684877 |
154.49228032277207 | 5 | 0.2766627177824275 |
83.42659346525942 | 6 | 0.07302188101247957 |
132.18007018219515 | 7 | 0.12741485218159052 |
Here's the query plan output of EXPLAIN ANALYZE:
Finalize GroupAggregate (cost=353138.03..1375844.01 rows=11 width=18) (actual time=32163.046..55491.620 rows=8 loops=1)
Group Key: nhdflowlinehr.stream_order
-> Gather Merge (cost=353138.03..1375843.68 rows=22 width=18) (actual time=8306.590..55491.555 rows=23 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=352138.01..1374841.12 rows=11 width=18) (actual time=3116.023..35914.559 rows=8 loops=3)
Group Key: nhdflowlinehr.stream_order
-> Sort (cost=352138.01..352163.51 rows=10200 width=1027) (actual time=554.512..561.702 rows=3290 loops=3)
Sort Key: nhdflowlinehr.stream_order
Sort Method: external merge Disk: 2888kB
Worker 0: Sort Method: quicksort Memory: 1644kB
Worker 1: Sort Method: external merge Disk: 3768kB
-> Parallel Bitmap Heap Scan on nhdflowlinehr (cost=874.15..346856.38 rows=10200 width=1027) (actual time=350.164..545.643 rows=3290 loops=3)
Filter: st_intersects(geom, '01060000...F2274640'::geometry)
Planning Time: 62.047 ms
JIT:
Functions: 30
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.602 ms, Inlining 203.120 ms, Optimization 481.906 ms, Emission 324.416 ms, Total 1015.044 ms
Execution Time: 55494.523 ms
In this example, the inner query results in 9869 rows.
Seeing 100% CPU use in RDS:
Turning off Celery doesn't seem to help. Installed pg_stat_statements
as recommended in https://aws.amazon.com/premiumsupport/knowledge-center/rds-aurora-postgresql-high-cpu/ to investigate queries.
Other, non-analyze queries that take a long time are tile requests and text searches:
SELECT round(total_exec_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 3;
total_time | query
---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
272881959.314 | SELECT ST_AsBinary("geom") AS geom,"stream_order" FROM (SELECT geom, stream_order FROM nhdflowlinehr WHERE stream_order >= $1) as cdbq WHERE "geom" && ST_SetSRID($2::box3d, $3)
125818041.86 | SELECT id, code, name, rank, ST_X(center) AS x, ST_Y(center) AS y +
| FROM ((SELECT id, $1 AS code, name, $2 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc08 +
| WHERE UPPER(name) LIKE UPPER($3) +
| LIMIT $4) UNION ALL (SELECT id, $5 AS code, name, $6 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc10 +
| WHERE UPPER(name) LIKE UPPER($7) +
| LIMIT $8) UNION ALL (SELECT id, $9 AS code, name, $10 AS rank, +
| ST_Centroid(geom) as center +
| FROM boundary_huc12 +
| WHERE UPPER(name) LIKE UPPER($11) +
| LIMIT $12)) AS subquery +
| ORDER BY rank DESC, name
96767941.593 | WITH stream_intersection AS ( +
| SELECT ST_Length(ST_Transform( +
| ST_Intersection(geom, +
| ST_SetSRID(ST_GeomFromGeoJSON($1), +
| $2)), +
| $3)) AS lengthm, +
| stream_order, +
| slope +
| FROM nhdflowlinehr +
| WHERE ST_Intersects(geom, +
| ST_SetSRID(ST_GeomFromGeoJSON($4), $5))) +
| +
| SELECT SUM(lengthm) / $6 AS lengthkm, +
| stream_order, +
| SUM(lengthm * NULLIF(slope, $7)) / $8 AS slopesum +
| FROM stream_intersection +
| GROUP BY stream_order
(3 rows)
This does make it look like the stream intersections are the ones taking the longest time:
SELECT pid, age(clock_timestamp(), query_start), usename, query, state
FROM pg_stat_activity
ORDER BY query_start DESC;
Since CPU usage has been high for more than an hour now, I'm going to try and kill some of those queries.
Ah those queries just finished on their own. And we see a corresponding drop in CPU use:
Anthony mentioned that upping the size or type of the RDS instance may help in the short term. We should keep that in mind if our optimizations do not help.
Since the deployment of 1.33, we're observing very high IOPS use of the RDS instance in production:
This is coincident with the site going down a lot:
And we see these messages in RDS:
Since this is happening after 1.33, it is highly likely it has something to do with the High Res Streams, which is a very large table introduced in that release. Also, the high IOPS use may be caused by a very high request rate, possibly caused by a large number of requests from the tiler.
It is possible that #3424 may help. Or batch-generation of tiles. Or transferring the tiles already made for staging to production. Or it may be something else entirely.
Investigate, and if possible, fix.