GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Crashes Missing from Crashmapper - CB107 #34

Closed sync-by-unito[bot] closed 2 years ago

sync-by-unito[bot] commented 3 years ago

Forwarded by Christine:

Greetings -

I have been reviewing crash data for Manhattan CB7 and notice that a large number of crashes aren't showing up when I filter for Community Board 107.

Of the crashes I downloaded from the Socrata NYC open data portal (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data) filtered for CB7 by Lat/Long https://data.cityofnewyork.us/NYC-BigApps/UWS-Latitudes-and-Longitudes/ujp2-3ypa and then manually deleting crashes that are outside the district, 20,766 of 25,312 are in Crashmapper and 4,546 are not. This includes 3 fatalities that are not in Crashmapper (COLLISION_ID 3959009, 3562386, and 4354569).

Here is my data in Google Sheets https://docs.google.com/spreadsheets/d/1T8I6XXOfrRo_xG6kCW-XTtUOAz4gc06bmOU8Xh1QEgI/edit#gid=503430291 with tabs for the data I downloaded from Crashmapper (one download with all categories except no injuries, and another with only no injuries), and from Socrata. Note that I used vlookup in column B of the "UWS_Latitudes_and_Longitudes" tab to see if the COLLISION_ID number in column Z of that tab appeared in column C (socrata_id) of the "Crashmapper CB7 Data" tab. (The data for "Crashmapper CB7 Data" was from an export of Crashmapper filtered for CB7 https://crashmapper.org/#/?cfat=false&cinj=false&endDate=2020-09&geo=community_board&identifier=107&lat=40.78535069903749&lng=-73.95360946655273&lngLats=%255B%255D&mfat=false&minj=false&noInjFat=true&pfat=false&pinj=false&startDate=2020-09&vbicycle=true&vbusvan=true&vcar=true&vmotorcycle=true&vother=true&vscooter=true&vsuv=true&vtruck=true&zoom=13.) I then changed any crash that showed a match to "In Crashmapper" and the others to "Not in Crashmapper." As the pivot table shows:

Year In Crashmapper Not in Crashmapper 2012 1411 280 2013 2592 546 2014 2523 562 2015 2725 540 2016 1724 1140 2017 2862 441 2018 2749 388 2019 2341 318 2020 1015 149 2021 824 182 Grand Total 20766 4546

Would you know why these crashes aren't appearing?

┆Issue is synchronized with this Asana task ┆Attachments: image.png | image.png | image.png ┆Due Date: 2021-11-30

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

Initial findings here: https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 ( https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 )

Querying by Socrata_ID reveals only 88 crashes truly missing from data. Out of 4546 in client's sheet vs 4458 returned from in() query in CARTO with full list of IDs.

Of the 4458 that we have, only 840 are missing latlng. Based on this, I would expect that 928 could be missing from any location assignments, since they can't be assigned to areas if we have no coordinates.

But then if we just map the output of the client sheet vs the CARTO output, it looks like we have a lot more in the client sheet: image

vs CARTO export from the ID query: image

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

And the three fatalities he mentions by ID are all in the data: SELECT * from crashes_all_prod where socrata_id in (3959009, 3562386, 4354569)

But one of them is missing geometry. image

I wonder if the real issue is some problem with the export he got, combined with some staleness in our data vs Socrata

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

Also, none of the three they say are missing are assigned to CB107. One is 104, one is 164, and the other is null since it has no geometry.

danrademacher commented 3 years ago
More findings: Measure Results % of total missing
Total reported missing 4546  
Present, w/xy 3618  
Present, w/o xy 928  
Present with matching XY 3313 73%
Present, missing or different xy 1233 27%
Missing 88 2%
Unique XY Crashmapper 550  
Unique XY Socrata 1172  

Still not sure what the underlying cause is, but it feels like we need a script that uses Socrata ID to go back and update XY, and then rerun boundary intersections

danrademacher commented 2 years ago

Further review of coordinates strongly suggests this is an issue of later improvements in geocoding. image

The community board outlines and labels show here, and the red clustered markers are from Crashmapper, while the purple markers are the client-supplied export from Socrata. It looks like CM data is heavily clustered along the edges of 107 and 164. And all of those along the edge are classed as 164.

Looking at the balance among the CARTO data I pulled using the Socrata IDs from the client, virtually all of them appear to be in 164:

Community Board CARTO
107 18
164 2405

So it does seem like we'll need a fix_xyprocess to go along with our fix_tallies, but doing that on the whole database seems daunting.

danrademacher commented 2 years ago

Here's a map where the green diamonds are ones where CARTO matches SOCRATA though the CB assignment is ambiguous because they are on the border. The red ones are ones where updating XY will solve the issue: image

So:

  1. We need a new script to update XYs for existing records
  2. We will likely still have a "short" on CB107 for crashes that are on the borderline.

For example, the client reported 3 fatalities missing from CB107. These were noteworthy enough that folks might seek them out in the data. As noted above, all three are in CARTO. One is missing geometry but the other two have coordinated AND those coordinates match what is in SOCRATA. So nothing on item 1 will fix these.

As shown here, they are right on the edge and both got assigned to neighboring Community Boards. The large pink dot is CARTO and the small purple is client-supplied Socrata, and the label is the assigned CB for the point in CARTO: image

danrademacher commented 2 years ago

Christine asked whether we could count the crashes in both CBs. For now, let's stay focused on the XY updates.

For the record, I don't see a simple way to do multi-assign crashes since we preassign Community Board with a simple st_within query: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/a3fdc2ca0cff89217e964d37362ae88e31b156ac/main.py#L498

Somehow assigning to multiples, we'd need to:

  1. Rewrite that query to capture up to 4 CBs (in a "Four Corners" scenario)
  2. Change the data structure to store an array or otherwise allow for tagging in multiple CBs
  3. Change the frontend on Map and Charts to query that new array data type.
  4. Almost certainly then do the same thing for the other 7 boundary types
danrademacher commented 2 years ago

This is interesting, drawing on a sample of just things around CB107:

Year Mismatched Count
2021 58
2020 15
2019 13
2018 24
2017 89
2016 1032
2015 0
2014 1
2013 0
2012 1
gregallensworth commented 2 years ago

A quick spot check of the 3 mentioned in the issue, and confirmed that their geometry matches the latest from SODA:

select
socrata_id, date_val,
ST_X(the_geom), ST_Y(the_geom), community_board
from crashes_all_prod
WHERE socrata_id IN (3959009, 3562386, 4354569)

https://data.cityofnewyork.us/resource/h9gi-nx95.json?$where=collision_id%20IN%20(3959009,%203562386,%204354569)

The old Community Board assignments for those three are still accurate: collision_id CB lng lat
3959009 164 -73.97874 40.772415
4354569 104 -73.9821 40.76889
3562386 107 -73.97504 40.79021
gregallensworth commented 2 years ago

As to CB 107, short of knowing the methodology used in your prior test, hopefully these stats will be helpful?

SELECT year, COUNT(*) FROM crashes_all_prod WHERE community_board = 107 GROUP BY year ORDER BY year
year crashes in CB 107
2012 2168
2013 2608
2014 2537
2015 2753
2016 2675
2017 2974
2018 2796
2019 2379
2020 1038
2021 938
total 23506
danrademacher commented 2 years ago

Notes:

  1. The null_geom records took a week to run, fixed 38,000 records
  2. The one to find geom differences, took 4 days to compile the list of lat-longs for comparison, then apply the 38,000 differences ran overnight. (not the same ones, the similarity in number is coincidence)
danrademacher commented 2 years ago

Just spoke to client. she's happy with this level of improvement for the CB107 "Everyone has their own versions of data. If we're closer to matching that's good enough"