GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Redistricting Assembly and City Council districts #38

Closed danrademacher closed 1 year ago

danrademacher commented 1 year ago

┆Issue is synchronized with this Asana task

danrademacher commented 1 year ago

Assembly

Here's the current data on carto, https://chekpeds.carto.com/tables/nyc_assembly/map, clearly clipped to shoreline: image

Looks like the data on Carto has a field with "Assembly District " concatenated onto the integer, so we'd need to do that when we make the new data.

City Council

Hmm, the current City Council data on Carto is a bit weird: https://chekpeds.carto.com/tables/nyc_city_council

Pretty weird ID here: image

New data:

danrademacher commented 1 year ago

Actually, I think those other fields don't matter so long as identifier exists as a bare integer: https://github.com/GreenInfo-Network/nyc-crash-mapper/blob/26b97a276d2f207469f428739fac80416ea6bdd6/src/constants/sql_queries.js#L27-L34 https://github.com/GreenInfo-Network/nyc-crash-mapper/blob/26b97a276d2f207469f428739fac80416ea6bdd6/src/constants/sql_queries.js#L67-L75

then the labels happen here: https://github.com/GreenInfo-Network/nyc-crash-mapper/blob/26b97a276d2f207469f428739fac80416ea6bdd6/src/constants/app_config.js#L55-L62

danrademacher commented 1 year ago

So we should be ready for this hardest checkbox:

Since that could be a long running process and we don't want to feel under pressure to get it done while we're changing live data, here's my proposal:

  1. Leave the new boundary tables in their current state with _2023 appended to the names
  2. In https://chekpeds.carto.com/tables/crashes_all_prod, create new assembly_2023 and city_council_2023 fields
  3. Target those fields for the reassignment script
  4. Test in a feature branch in frontend applications using the new _2023 geom tables and assembly_2023 and city_council_2023 fields
  5. Once we are satisfied, we merge in the changes and deploy, then rename the old tables and fields to have _2018 suffixes

that will leave things in a clear state, but also make it easy to roll back if there's some problem with the assignments.

At some later date when we're satisfied (or even next year if we want to leave it for a while), we could delete the old _2018 tables and fields.

gregallensworth commented 1 year ago

DRAFT work in progress as I suss things out

New tables: https://chekpeds.carto.com/tables/nyc_assembly_2023/public https://chekpeds.carto.com/tables/nyc_city_council_2023/public

I recall too that the field name in crashes is hardcoded to the name of the area, e.g. the city_council visualization/selection implicitly uses the city_council field from crashes, so using the field name city_council_2023 means changing that name globally throughout. So there will likely be steps I don't know yet...

Then give everything a test, and see what we think. We know that some counts already do not match due to NYC retroactively changing past data, so not sure how we'll QC this. Maybe just seeing that the same crashes still have assembly is null (assumes districts have changed internally, but not expanded; though I do see a few new islands in the visible map) and that a few have changed assembly != assembly_2023 ?

gregallensworth commented 1 year ago

Invalid geometries in the city council districts were causing tallies to match. If we ever do this again, be sure to validate/fix the geometries.

Citywide view work, then I could bring up City Council District and see them. But upon clicking some of the districts no crash points would display. Upon clearing the specific area selection, the districts would appear back on the map BUT would show visible gaps where no crashes were present, but previously were.

-- 17 were bad
SELECT COUNT(*) FROM nyc_city_council_2023 WHERE NOT ST_ISVALID(the_geom)

-- fixed
UPDATE nyc_city_council_2023 SET the_geom = ST_MAKEVALID(the_geom) WHERE NOT ST_ISVALID(the_geom)

-- fixed! and working
SELECT COUNT(*) FROM nyc_city_council_2023 WHERE NOT ST_ISVALID(the_geom)

The nyc_assembly_2023 table did not have invalidities, and seemed to work fine all along.

Diagnosis for this bug having this behavior only AFTER selecting a specific area: Changing from Citywide to a district simply adds the boundaries, but will not change the crash filter to include a spatial join until a specific area is selected. De-selecting that specific area to show boundaries again, keeps the spatial join which was not working.

gregallensworth commented 1 year ago

Code changes ultimately were confined to a single file

Ultimately, the only code changes required for the frontend to use the new districts, were the table definitions: src/constants/app_config.js cartoTables

Crashes do have fields for assembly city_council borough et al, but in fact filterByIdentifierWhereClause() is only used for intersections and boroughs. All other cases use a runtime spatial join, and there's a note that boroughs are an exception due to timing out.

So the frontend changes are super simple: a two-line patch. Arguably, the two new fields on crashes_all_prod for assembly_2023 and city_council_2023 fields is optional, since nothing is using them... except downloads for offline analysis.

gregallensworth commented 1 year ago

I have started running the updates

Turns out the initialpolygonlinkage script really was made for exactly this, and I left good documentation. Doing it in blocks of 2000 crashes at a time, this is 102000 queries

gregallensworth commented 1 year ago
-- overall counts
-- about 205K crashes never fit into any Assembly nor City Council district
-- and that has not changed much

SELECT COUNT(*) FROM crashes_all_prod WHERE city_council_2023 IS NULL
207,512

SELECT COUNT(*) FROM crashes_all_prod WHERE city_council IS NULL
211,578

SELECT COUNT(*) FROM crashes_all_prod WHERE assembly_2023 IS NULL
205,954

SELECT COUNT(*) FROM crashes_all_prod WHERE assembly IS NULL
205,540
danrademacher commented 1 year ago

Ok, based on the above, I think we're ready to deploy these changes, though I just realized that we might have to make changes in https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view as well for the new tables. I am honestly not sure if it duplicates the spatial queries from the map application is https://github.com/GreenInfo-Network/nyc-crash-mapper, which would mean code changes in both places, or not

gregallensworth commented 1 year ago

Digging into the 450ish crashes that don't fit. It seems to be edge effects, as expected. There are slight differences in the polygon datasets, as you can see below, which really do cause the crash-points to be excluded in 2023 when they were included in 2015. Judging by the Measure Line tool in QGIS, the difference between the datasets is typically under 1 meter, but this is enough when the crash was exactly on an edge.

-- exported these mismatches as GeoJSON
SELECT cartodb_id, ST_X(the_geom) AS lng, ST_Y(the_geom) AS lat FROM crashes_all_prod WHERE city_council_2023 IS NULL AND city_council IS NOT NULL
SELECT cartodb_id, ST_X(the_geom) AS lng, ST_Y(the_geom) AS lat FROM crashes_all_prod WHERE assembly_2023 IS NULL AND assembly IS NOT NULL

Overview

image

Assembly 2023

image

Assembly 2015

image

City Council 2023

image

City Council 2015

image

A Measurement Example

image

Proposal

I could do a second pass to update any crashes where city_council_2023 IS NULL and city_council IS NOT NULL but this time with a buffer of 3 meters (10 feet) on the crashes. This should mop up the majority of these 450 outliers.

gregallensworth commented 1 year ago

Several hours later, running scripts to buffer polygons by 0.00002 degrees (about 2 meters) and having CARTO repeatedly time out, or falsely report 0 hits, I gave up on the old methods. I sat down with QGIS and made up lists of cartodb_id and City Council / Assembly identifier, then made up a list of SQL statements to run.

Eventually, we now have only 1 record where assembly_2023 and city_council_2023 are null but assembly and city_council are null. And that's because CARTO will time out updating this one specific record.

-- this one row will not let me update it; CARTO SQL API and web UI time out
UPDATE crashes_all_prod SET city_council_2023 = 32 WHERE cartodb_id=1209236
UPDATE crashes_all_prod SET assembly_2023 = 23 WHERE cartodb_id=1209236

Update: it eventually let me update 1209236 a half-hour later

So, ~not counting this 1 record~, the new assignments are done and the fields & tables could be swapped into place. See the "then some day, swap them" checklist item above.

gregallensworth commented 1 year ago

All set! I have renamed the columns and the fields, and the mapper and chart view are using the new districts.

danrademacher commented 1 year ago

With #39 resolved this is done

danrademacher commented 1 year ago

Also for the record, I suggested we do Senate and she said she doesn't care about those. Also was reminded that 2022 NYS redistricting was a mess so possibly senate reverted to old lines.