GreenInfo-Network / nyc-crash-mapper

React Redux web application for viewing, filtering, & comparing trends of automobile collision data published by the NYPD.
http://www.crashmapper.org
MIT License
16 stars 4 forks source link

Add new boundary layers for leg districts #96

Closed danrademacher closed 5 years ago

danrademacher commented 5 years ago

Follow on from as-yet unfunded Vision Zero work (eg #94), they are looking to add state senate and assembly districts as boundary types.

Looks like NYC open data has the boundaries we need: https://data.cityofnewyork.us/City-Government/State-Assembly-Districts/pf5b-73bw https://data.cityofnewyork.us/City-Government/State-Senate-Districts/h4i2-acfi

Current pattern is to load these into CARTO like this: https://chekpeds.carto.com/tables/nyc_city_council

and then looks like a few places where new boundaries would need to be set up: https://github.com/search?q=org%3AGreenInfo-Network+nyc_city_council&type=Code

A few details would be determining naming convention and any constraints around adding to the lists/UIs in various spots

Looks like as have integer fields for assem_dist and st_sen_dist. Probably want to preprend these with "Assembly District " and "Senate District " respectively.

Does this need to become part of the ETL script assignment process?

danrademacher commented 5 years ago

OK, we have a budget for this, CHEKPEDS:Crashmapper LegDistricts in tracking. Could be fill in work before Thanksgiving. I think this will actually be relatively straightforward. (Famous last words!)

gregallensworth commented 5 years ago

Part 1: Data Imported

This step is easily done. Both tables are publicly visible:

https://chekpeds.carto.com/tables/nyc_senate https://chekpeds.carto.com/tables/nyc_assembly

Like the nyc_city_council areas, the Senate and Assembly do not fall into single Borough areas, so are not tagged with a borough field.

Part 2: Initial Linking of Crashes to Areas

Using CARTO's UI, I created two new columns on crashes_prod_all -- the assembly and senate columns, both integer to match the identifier field in those tables.

Populating these proved to be problematic, due to CARTO's platform limits: queries taking more than 5 seconds are terminated. As such the ETL script's UPDATE FROM WHERE technique to do them all at once does not work, nor does a simple UPDATE WHERE INTERSECTS() with a single district selected. I have to do something really goofy to do it in tiny chunks, like this doing it one tenth of one district at a time:

UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 0
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 1
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 2
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 3
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 4
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 5
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 6
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 7
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 8
UPDATE crashes_all_prod SET assembly=24 WHERE assembly is null and ST_INTERSECTS(the_geom, (SELECT the_geom FROM nyc_assembly WHERE identifier=24)) AND cartodb_id % 10 = 9

I wrote a program to do this in a loop, with a few seconds of sleep time in between. Pasting this into CARTO's UI one at a time is over 1300 paste operations, and is also error prone.

Results. Out of the 1.4M crash rows:

gregallensworth commented 5 years ago

Client-Side: Map View

Preliminary lists of likely places to change, thanks to searching in Github: https://github.com/GreenInfo-Network/nyc-crash-mapper/search?q=city_council&unscoped_q=city_council https://github.com/GreenInfo-Network/nyc-crash-mapper/search?q=nyc_city_council&unscoped_q=nyc_city_council

Fortunately this looked to really be it! Between the two searches, and a little copy-and-paste, I have the districts showing on the map, clickable to filter properly, and displaying with nice labels as we want.

Great!

Screenshots as of commit 21eaefa

image

image

image

gregallensworth commented 5 years ago

Part 3: ETL Script

The ETL script has functions update_neighborhood() et al, which are all the same UPDATE SET WHERE structure. For NYPD Precincts, Borough, Community, etc. these all run in about 1 second. This is despite the fact that some 100k-200k rows fall outside of any area and are thus null.

Assembly and Senate have the expected performance: After 5 seconds, a timeout.

This executes in 1 second:

UPDATE crashes_all_prod
SET city_council = a.identifier
FROM nyc_city_council a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND (crashes_all_prod.city_council IS NULL)

This times out after 5 seconds:

UPDATE crashes_all_prod
SET senate = a.identifier
FROM nyc_senate a
WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom)
AND (crashes_all_prod.senate IS NULL)

Using EXPLAIN sheds some light on why it's slower, but not on why CARTO/PostgreSQL/PostGIS are choosing a super-slow strategy:

-- FAST query for nypd_precinct et al
Update on crashes_all_prod  (cost=0.14..351969.43 rows=8563 width=386)
  ->  Nested Loop  (cost=0.14..351969.43 rows=8563 width=386)
        ->  Seq Scan on crashes_all_prod  (cost=0.00..99236.23 rows=174755 width=364)
              Filter: ((the_geom IS NOT NULL) AND (nypd_precinct IS NULL))
        ->  Index Scan using cartodb_query_the_geom_idx on nyc_nypd_precinct a  (cost=0.14..1.44 rows=1 width=1108)
              Index Cond: (the_geom ~ crashes_all_prod.the_geom)
              Filter: _st_contains(the_geom, crashes_all_prod.the_geom)
-- SLOW query for Assembly and Senate
Update on crashes_all_prod  (cost=56.26..323460.12 rows=3663 width=386)
  ->  Nested Loop  (cost=56.26..323460.12 rows=3663 width=386)
        ->  Seq Scan on nyc_assembly a  (cost=0.00..29.65 rows=65 width=23883)
        ->  Bitmap Heap Scan on crashes_all_prod  (cost=56.26..4975.15 rows=56 width=368)
              Recheck Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))
              Filter: ((assembly IS NULL) AND _st_contains(a.the_geom, the_geom))
              ->  Bitmap Index Scan on etl_test_copy_the_geom_idx  (cost=0.00..56.25 rows=1313 width=0)
                    Index Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))

In the version that existing tables use, the scan is done for "geom not null but location is" which eliminates 99.9% of records in step 1. For the new tables, the query planner is looping over all polygons and looking for records which "location is not null and is inside here" then filtering "location is null" Why?

gregallensworth commented 5 years ago

Client-Side: Chart App

Preliminary list of likely code changes, from Github search: https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/search?q=city_council&unscoped_q=city_council https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/search?q=city_council&unscoped_q=nyc_city_council

And not too bad:

image

image

image

Still have to iron out some wording issues in the "Search a assembly" text filter. It gets this right for "city council" and I seem to recall some hardcoded behaviors about how the word is generated from the layer name...

gregallensworth commented 5 years ago

Chart View, Area names in search filter box

Here we go, as of commit c30e7b7

image

image

gregallensworth commented 5 years ago

Checklist:

gregallensworth commented 5 years ago

The next day (16 hours later, some 20 hours after creating the tables and columns) the same thing s still happening. The query planner is giving very different and far less efficient results for these 2 new tables and columns, than it does for the previously-existing ones. There are no indexes shown in the pg_indexes table beyond the 3 standard ones for every table. There may be an index out of date that a ANALYZE could help, but we can't do those through the UI.

I have emailed support@carto.com since CHEKPEDS has a paid account, and we'll see if they have insight from their side.

Email as follows:

Hello, my name is Greg Allensworth and I'm working with CHEKPEDS. We have a paid account, and I'm hoping you can help me with optimizing a query. Specifically, the query itself is identical to other we have written but the query planner is giving me some strange results.

The following query executes in about 1 second, and we get similar times for some 5 other queries identical to this one for NYPD precincts, boroughs, neighborhoods, ...

UPDATE crashes_all_prod SET city_council = a.identifier FROM nyc_city_council a WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom) AND (crashes_all_prod.city_council IS NULL)

However, I have two new tables nyc_senate and nyc_assembly, and that same query on either of these will time out after 3 seconds:

UPDATE crashes_all_prod SET senate = a.identifier FROM nyc_senate a WHERE crashes_all_prod.the_geom IS NOT NULL AND ST_Within(crashes_all_prod.the_geom, a.the_geom) AND (crashes_all_prod.senate IS NULL)

General record counts etc. for the "nyc_senate" and "nyc_city_council" are not too dissimilar, and tallies for crashes_all_prod with geom NOT null and field IS null are similar. Also, there is little-to-no variation in the timing of the 5 other tables where we do this same query for precincts, neighborhoods, boroughs, etc. It is just these 2 new tables.

The EXPLAIN output is puzzling, showing that these two new tables get a very different query plan.

The fast query has very sensible planner output. The first pass is a scan over crashes_all_prod where geom is NOT null and precinct IS null; this excludes some 99% of records. It then uses the spatial index to relate point to polygon. Great.

-- FAST query for nypd_precinct et al Update on crashes_all_prod (cost=0.14..351969.43 rows=8563 width=386) -> Nested Loop (cost=0.14..351969.43 rows=8563 width=386) -> Seq Scan on crashes_all_prod (cost=0.00..99236.23 rows=174755 width=364) Filter: ((the_geom IS NOT NULL) AND (nypd_precinct IS NULL)) -> Index Scan using cartodb_query_the_geom_idx on nyc_nypd_precinct a (cost=0.14..1.44 rows=1 width=1108) Index Cond: (the_geom ~ crashes_all_prod.the_geom) Filter: _st_contains(the_geom, crashes_all_prod.the_geom)

This new query, though, very different output. It loops over polygons, does the geometry null in the first pass along with the bbox, and so on. Clearly less efficient.

-- SLOW query for Assembly and Senate Update on crashes_all_prod (cost=56.26..323460.12 rows=3663 width=386) -> Nested Loop (cost=56.26..323460.12 rows=3663 width=386) -> Seq Scan on nyc_assembly a (cost=0.00..29.65 rows=65 width=23883) -> Bitmap Heap Scan on crashes_all_prod (cost=56.26..4975.15 rows=56 width=368) Recheck Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL)) Filter: ((assembly IS NULL) AND _st_contains(a.the_geom, the_geom)) -> Bitmap Index Scan on etl_test_copy_the_geom_idx (cost=0.00..56.25 rows=1313 width=0) Index Cond: ((a.the_geom ~ the_geom) AND (the_geom IS NOT NULL))

Do you have any insight from your end, as to why the query planner would behave so differently for these two new tables? Their structure is identical to these other tables. I checked the pg_indexes table as well, and all tables have just CARTO's 3 standard indexes. Could it be because these tables are new, or that the "assembly" and "senate" columns in crashes_all_prod are new? Would a VACUUM or ANALYZE help?

danrademacher commented 5 years ago

Hmm. I just did a local checkout and run of this code and it seems to be working fine for me in terms of speed -- about the same as other geom type queries. Maybe whatever database weirdness was happening has resolved itself through some maintenance process we don't see?

But I did find a new-to-me bug that I see now is more generally about selected geog ID mismatches between map and chart app. Will file that separately.

gregallensworth commented 5 years ago

New ETL script is in place. After 2 weeks of email exchanges with CARTO, they couldn't figure out why the query planner was going the way it was, thus why it wasn't able to run in <2 seconds like the others. So I gave up and instead have it run in 10 "slices" using AND cartodb_id % 10 = X so Senate and Assembly are now 10 queries at 0.8 sec apiece instead of 1 query that times out at 10 seconds.

About 1 hour from now, the ETL should run and https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs should indicate successful update of these new senate and assembly fields, in 20 reasonably-timed queries.