Closed danrademacher closed 5 years ago
OK, we have a budget for this, CHEKPEDS:Crashmapper LegDistricts
in tracking. Could be fill in work before Thanksgiving. I think this will actually be relatively straightforward. (Famous last words!)
This step is easily done. Both tables are publicly visible:
https://chekpeds.carto.com/tables/nyc_senate https://chekpeds.carto.com/tables/nyc_assembly
identifier
which is effectively the name and id
which is text that is not used.identifier
fields as string, so I had to change those in CARTO's table UI.Like the nyc_city_council
areas, the Senate and Assembly do not fall into single Borough areas, so are not tagged with a borough
field.
Using CARTO's UI, I created two new columns on crashes_prod_all
-- the assembly
and senate
columns, both integer to match the identifier
field in those tables.
Populating these proved to be problematic, due to CARTO's platform limits: queries taking more than 5 seconds are terminated. As such the ETL script's UPDATE FROM WHERE
technique to do them all at once does not work, nor does a simple UPDATE WHERE INTERSECTS()
with a single district selected. I have to do something really goofy to do it in tiny chunks, like this doing it one tenth of one district at a time:
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 0
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 1
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 2
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 3
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 4
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 5
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 6
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 7
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 8
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 9
I wrote a program to do this in a loop, with a few seconds of sleep time in between. Pasting this into CARTO's UI one at a time is over 1300 paste operations, and is also error prone.
Results. Out of the 1.4M crash rows:
the_geom
Preliminary lists of likely places to change, thanks to searching in Github: https://github.com/GreenInfo-Network/nyc-crash-mapper/search?q=city_council&unscoped_q=city_council https://github.com/GreenInfo-Network/nyc-crash-mapper/search?q=nyc_city_council&unscoped_q=nyc_city_council
Fortunately this looked to really be it! Between the two searches, and a little copy-and-paste, I have the districts showing on the map, clickable to filter properly, and displaying with nice labels as we want.
Great!
Screenshots as of commit 21eaefa
The ETL script has functions update_neighborhood()
et al, which are all the same UPDATE SET WHERE
structure. For NYPD Precincts, Borough, Community, etc. these all run in about 1 second. This is despite the fact that some 100k-200k rows fall outside of any area and are thus null.
Assembly and Senate have the expected performance: After 5 seconds, a timeout.
This executes in 1 second:
UPDATE crashes_all_prod
SET city_council = a.identifier
FROM nyc_city_council a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND (crashes_all_prod.city_council IS NULL)
This times out after 5 seconds:
UPDATE crashes_all_prod
SET senate = a.identifier
FROM nyc_senate a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND (crashes_all_prod.senate IS NULL)
Using EXPLAIN
sheds some light on why it's slower, but not on why CARTO/PostgreSQL/PostGIS are choosing a super-slow strategy:
-- FAST query for nypd_precinct et al
Update on crashes_all_prod (cost=0.14..351969.43 rows=8563 width=386)
-> Nested Loop (cost=0.14..351969.43 rows=8563 width=386)
-> Seq Scan on crashes_all_prod (cost=0.00..99236.23 rows=174755 width=364)
Filter: ((the_geom IS NOT NULL) AND (nypd_precinct IS NULL))
-> Index Scan using cartodb_query_the_geom_idx on nyc_nypd_precinct a (cost=0.14..1.44 rows=1 width=1108)
Index Cond: (the_geom ~ crashes_all_prod.the_geom)
Filter: _st_contains(the_geom, crashes_all_prod.the_geom)
-- SLOW query for Assembly and Senate
Update on crashes_all_prod (cost=56.26..323460.12 rows=3663 width=386)
-> Nested Loop (cost=56.26..323460.12 rows=3663 width=386)
-> Seq Scan on nyc_assembly a (cost=0.00..29.65 rows=65 width=23883)
-> Bitmap Heap Scan on crashes_all_prod (cost=56.26..4975.15 rows=56 width=368)
Recheck Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))
Filter: ((assembly IS NULL) AND _st_contains(a.the_geom, the_geom))
-> Bitmap Index Scan on etl_test_copy_the_geom_idx (cost=0.00..56.25 rows=1313 width=0)
Index Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))
In the version that existing tables use, the scan is done for "geom not null but location is" which eliminates 99.9% of records in step 1. For the new tables, the query planner is looping over all polygons and looking for records which "location is not null and is inside here" then filtering "location is null" Why?
Checked over all tables, and no indexes exist beyond the perfectly ordinary: all tables have cartodb_id
the_geom
and the_geom_webmercator
and that's it. Precincts and Boroughs, etc. do not have any sort of indexing, nor does crashes_all_prod
have special indexing beyond these defaults. So it's not something as simple as an index.
Checked with the the_geom IS NOT NULL
clauses, and found that Senate and Assembly both have 5582 records where the_geom
is not null but the senate
and assembly
fields are null. Contrast this with NYPD Precinct where 10657 records have a non-null geometry and fell outside areas. So it's not sheer number of records.
Preliminary list of likely code changes, from Github search: https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/search?q=city_council&unscoped_q=city_council https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/search?q=city_council&unscoped_q=nyc_city_council
And not too bad:
Still have to iron out some wording issues in the "Search a assembly" text filter. It gets this right for "city council" and I seem to recall some hardcoded behaviors about how the word is generated from the layer name...
Here we go, as of commit c30e7b7
Checklist:
[x] Map view: areas toggle, hover tooltip, click, filter, display title
[x] Chart view: Trend, Compare, Rank, ... more?
[x] ETL script
The next day (16 hours later, some 20 hours after creating the tables and columns) the same thing s still happening. The query planner is giving very different and far less efficient results for these 2 new tables and columns, than it does for the previously-existing ones. There are no indexes shown in the pg_indexes
table beyond the 3 standard ones for every table. There may be an index out of date that a ANALYZE
could help, but we can't do those through the UI.
I have emailed support@carto.com since CHEKPEDS has a paid account, and we'll see if they have insight from their side.
Email as follows:
Hello, my name is Greg Allensworth and I'm working with CHEKPEDS. We have a paid account, and I'm hoping you can help me with optimizing a query. Specifically, the query itself is identical to other we have written but the query planner is giving me some strange results.
The following query executes in about 1 second, and we get similar times for some 5 other queries identical to this one for NYPD precincts, boroughs, neighborhoods, ...
UPDATE crashes_all_prod SET city_council = a.identifier FROM nyc_city_council a WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom) AND (crashes_all_prod.city_council IS NULL)
However, I have two new tables nyc_senate and nyc_assembly, and that same query on either of these will time out after 3 seconds:
UPDATE crashes_all_prod SET senate = a.identifier FROM nyc_senate a WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom) AND (crashes_all_prod.senate IS NULL)
General record counts etc. for the "nyc_senate" and "nyc_city_council" are not too dissimilar, and tallies for crashes_all_prod with geom NOT null and field IS null are similar. Also, there is little-to-no variation in the timing of the 5 other tables where we do this same query for precincts, neighborhoods, boroughs, etc. It is just these 2 new tables.
The EXPLAIN output is puzzling, showing that these two new tables get a very different query plan.
The fast query has very sensible planner output. The first pass is a scan over crashes_all_prod where geom is NOT null and precinct IS null; this excludes some 99% of records. It then uses the spatial index to relate point to polygon. Great.
-- FAST query for nypd_precinct et al Update on crashes_all_prod (cost=0.14..351969.43 rows=8563 width=386) -> Nested Loop (cost=0.14..351969.43 rows=8563 width=386) -> Seq Scan on crashes_all_prod (cost=0.00..99236.23 rows=174755 width=364) Filter: ((the_geom IS NOT NULL) AND (nypd_precinct IS NULL)) -> Index Scan using cartodb_query_the_geom_idx on nyc_nypd_precinct a (cost=0.14..1.44 rows=1 width=1108) Index Cond: (the_geom ~ crashes_all_prod.the_geom) Filter: _st_contains(the_geom, crashes_all_prod.the_geom)
This new query, though, very different output. It loops over polygons, does the geometry null in the first pass along with the bbox, and so on. Clearly less efficient.
-- SLOW query for Assembly and Senate Update on crashes_all_prod (cost=56.26..323460.12 rows=3663 width=386) -> Nested Loop (cost=56.26..323460.12 rows=3663 width=386) -> Seq Scan on nyc_assembly a (cost=0.00..29.65 rows=65 width=23883) -> Bitmap Heap Scan on crashes_all_prod (cost=56.26..4975.15 rows=56 width=368) Recheck Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL)) Filter: ((assembly IS NULL) AND _st_contains(a.the_geom, the_geom)) -> Bitmap Index Scan on etl_test_copy_the_geom_idx (cost=0.00..56.25 rows=1313 width=0) Index Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))
Do you have any insight from your end, as to why the query planner would behave so differently for these two new tables? Their structure is identical to these other tables. I checked the pg_indexes table as well, and all tables have just CARTO's 3 standard indexes. Could it be because these tables are new, or that the "assembly" and "senate" columns in crashes_all_prod are new? Would a VACUUM or ANALYZE help?
Hmm. I just did a local checkout and run of this code and it seems to be working fine for me in terms of speed -- about the same as other geom type queries. Maybe whatever database weirdness was happening has resolved itself through some maintenance process we don't see?
But I did find a new-to-me bug that I see now is more generally about selected geog ID mismatches between map and chart app. Will file that separately.
New ETL script is in place. After 2 weeks of email exchanges with CARTO, they couldn't figure out why the query planner was going the way it was, thus why it wasn't able to run in <2 seconds like the others. So I gave up and instead have it run in 10 "slices" using AND cartodb_id % 10 = X
so Senate and Assembly are now 10 queries at 0.8 sec apiece instead of 1 query that times out at 10 seconds.
About 1 hour from now, the ETL should run and https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs should indicate successful update of these new senate and assembly fields, in 20 reasonably-timed queries.
Follow on from as-yet unfunded Vision Zero work (eg #94), they are looking to add state senate and assembly districts as boundary types.
Looks like NYC open data has the boundaries we need: https://data.cityofnewyork.us/City-Government/State-Assembly-Districts/pf5b-73bw https://data.cityofnewyork.us/City-Government/State-Senate-Districts/h4i2-acfi
Current pattern is to load these into CARTO like this: https://chekpeds.carto.com/tables/nyc_city_council
and then looks like a few places where new boundaries would need to be set up: https://github.com/search?q=org%3AGreenInfo-Network+nyc_city_council&type=Code
A few details would be determining naming convention and any constraints around adding to the lists/UIs in various spots
Looks like as have integer fields for
assem_dist
andst_sen_dist
. Probably want to preprend these with "Assembly District " and "Senate District " respectively.Does this need to become part of the ETL script assignment process?