Crashes Missing from Crashmapper - CB107

sync-by-unito[bot] commented 3 years ago

Forwarded by Christine:

Greetings -

I have been reviewing crash data for Manhattan CB7 and notice that a large number of crashes aren't showing up when I filter for Community Board 107.

Of the crashes I downloaded from the Socrata NYC open data portal (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data) filtered for CB7 by Lat/Long https://data.cityofnewyork.us/NYC-BigApps/UWS-Latitudes-and-Longitudes/ujp2-3ypa and then manually deleting crashes that are outside the district, 20,766 of 25,312 are in Crashmapper and 4,546 are not. This includes 3 fatalities that are not in Crashmapper (COLLISION_ID 3959009, 3562386, and 4354569).

Here is my data in Google Sheets https://docs.google.com/spreadsheets/d/1T8I6XXOfrRo_xG6kCW-XTtUOAz4gc06bmOU8Xh1QEgI/edit#gid=503430291 with tabs for the data I downloaded from Crashmapper (one download with all categories except no injuries, and another with only no injuries), and from Socrata. Note that I used vlookup in column B of the "UWS_Latitudes_and_Longitudes" tab to see if the COLLISION_ID number in column Z of that tab appeared in column C (socrata_id) of the "Crashmapper CB7 Data" tab. (The data for "Crashmapper CB7 Data" was from an export of Crashmapper filtered for CB7 https://crashmapper.org/#/?cfat=false&cinj=false&endDate=2020-09&geo=community_board&identifier=107&lat=40.78535069903749&lng=-73.95360946655273&lngLats=%255B%255D&mfat=false&minj=false&noInjFat=true&pfat=false&pinj=false&startDate=2020-09&vbicycle=true&vbusvan=true&vcar=true&vmotorcycle=true&vother=true&vscooter=true&vsuv=true&vtruck=true&zoom=13.) I then changed any crash that showed a match to "In Crashmapper" and the others to "Not in Crashmapper." As the pivot table shows:

Year In Crashmapper Not in Crashmapper 2012 1411 280 2013 2592 546 2014 2523 562 2015 2725 540 2016 1724 1140 2017 2862 441 2018 2749 388 2019 2341 318 2020 1015 149 2021 824 182 Grand Total 20766 4546

Would you know why these crashes aren't appearing?

┆Issue is synchronized with this Asana task ┆Attachments: image.png | image.png | image.png ┆Due Date: 2021-11-30

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

Initial findings here: https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 ( https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 )

Querying by Socrata_ID reveals only 88 crashes truly missing from data. Out of 4546 in client's sheet vs 4458 returned from in() query in CARTO with full list of IDs.

Of the 4458 that we have, only 840 are missing latlng. Based on this, I would expect that 928 could be missing from any location assignments, since they can't be assigned to areas if we have no coordinates.

But then if we just map the output of the client sheet vs the CARTO output, it looks like we have a lot more in the client sheet:

vs CARTO export from the ID query:

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

And the three fatalities he mentions by ID are all in the data: SELECT * from crashes_all_prod where socrata_id in (3959009, 3562386, 4354569)

But one of them is missing geometry.

I wonder if the real issue is some problem with the export he got, combined with some staleness in our data vs Socrata

sync-by-unito[bot] commented 3 years ago

➤ Dan Rademacher commented:

Also, none of the three they say are missing are assigned to CB107. One is 104, one is 164, and the other is null since it has no geometry.

danrademacher commented 3 years ago

More findings: Measure	Results	% of total missing
Total reported missing	4546
Present, w/xy	3618
Present, w/o xy	928
Present with matching XY	3313	73%
Present, missing or different xy	1233	27%
Missing	88	2%
Unique XY Crashmapper	550
Unique XY Socrata	1172

Still not sure what the underlying cause is, but it feels like we need a script that uses Socrata ID to go back and update XY, and then rerun boundary intersections

danrademacher commented 2 years ago

Further review of coordinates strongly suggests this is an issue of later improvements in geocoding.

The community board outlines and labels show here, and the red clustered markers are from Crashmapper, while the purple markers are the client-supplied export from Socrata. It looks like CM data is heavily clustered along the edges of 107 and 164. And all of those along the edge are classed as 164.

Looking at the balance among the CARTO data I pulled using the Socrata IDs from the client, virtually all of them appear to be in 164:

Community Board	CARTO
107	18
164	2405

So it does seem like we'll need a fix_xyprocess to go along with our fix_tallies, but doing that on the whole database seems daunting.

danrademacher commented 2 years ago

Here's a map where the green diamonds are ones where CARTO matches SOCRATA though the CB assignment is ambiguous because they are on the border. The red ones are ones where updating XY will solve the issue:

So:

We need a new script to update XYs for existing records
We will likely still have a "short" on CB107 for crashes that are on the borderline.

For example, the client reported 3 fatalities missing from CB107. These were noteworthy enough that folks might seek them out in the data. As noted above, all three are in CARTO. One is missing geometry but the other two have coordinated AND those coordinates match what is in SOCRATA. So nothing on item 1 will fix these.

As shown here, they are right on the edge and both got assigned to neighboring Community Boards. The large pink dot is CARTO and the small purple is client-supplied Socrata, and the label is the assigned CB for the point in CARTO:

danrademacher commented 2 years ago

Christine asked whether we could count the crashes in both CBs. For now, let's stay focused on the XY updates.

For the record, I don't see a simple way to do multi-assign crashes since we preassign Community Board with a simple st_within query: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/a3fdc2ca0cff89217e964d37362ae88e31b156ac/main.py#L498

Somehow assigning to multiples, we'd need to:

Rewrite that query to capture up to 4 CBs (in a "Four Corners" scenario)
Change the data structure to store an array or otherwise allow for tagging in multiple CBs
Change the frontend on Map and Charts to query that new array data type.
Almost certainly then do the same thing for the other 7 boundary types

danrademacher commented 2 years ago

This is interesting, drawing on a sample of just things around CB107:

Year	Mismatched Count
2021	58
2020	15
2019	13
2018	24
2017	89
2016	1032
2015	0
2014	1
2013	0
2012	1

gregallensworth commented 2 years ago

A quick spot check of the 3 mentioned in the issue, and confirmed that their geometry matches the latest from SODA:

select
socrata_id, date_val,
ST_X(the_geom), ST_Y(the_geom), community_board
from crashes_all_prod
WHERE socrata_id IN (3959009, 3562386, 4354569)

https://data.cityofnewyork.us/resource/h9gi-nx95.json?$where=collision_id%20IN%20(3959009,%203562386,%204354569)

The old Community Board assignments for those three are still accurate:	collision_id	CB	lng
3959009	164	-73.97874	40.772415
4354569	104	-73.9821	40.76889
3562386	107	-73.97504	40.79021

gregallensworth commented 2 years ago

As to CB 107, short of knowing the methodology used in your prior test, hopefully these stats will be helpful?

SELECT year, COUNT(*) FROM crashes_all_prod WHERE community_board = 107 GROUP BY year ORDER BY year

year	crashes in CB 107
2012	2168
2013	2608
2014	2537
2015	2753
2016	2675
2017	2974
2018	2796
2019	2379
2020	1038
2021	938
total	23506

danrademacher commented 2 years ago

Notes:

The null_geom records took a week to run, fixed 38,000 records
The one to find geom differences, took 4 days to compile the list of lat-longs for comparison, then apply the 38,000 differences ran overnight. (not the same ones, the similarity in number is coincidence)

danrademacher commented 2 years ago

Just spoke to client. she's happy with this level of improvement for the CB107 "Everyone has their own versions of data. If we're closer to matching that's good enough"

GreenInfo-Network / nyc-crash-mapper-etl-script

Crashes Missing from Crashmapper - CB107 #34