Closed sync-by-unito[bot] closed 2 years ago
➤ Dan Rademacher commented:
Initial findings here: https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 ( https://docs.google.com/spreadsheets/d/1xY8BD7dWZIF53N8uv5T1Rr-_tJesQOHK5d12BczQ22M/edit#gid=340586909 )
Querying by Socrata_ID reveals only 88 crashes truly missing from data. Out of 4546 in client's sheet vs 4458 returned from in() query in CARTO with full list of IDs.
Of the 4458 that we have, only 840 are missing latlng. Based on this, I would expect that 928 could be missing from any location assignments, since they can't be assigned to areas if we have no coordinates.
But then if we just map the output of the client sheet vs the CARTO output, it looks like we have a lot more in the client sheet:
vs CARTO export from the ID query:
➤ Dan Rademacher commented:
And the three fatalities he mentions by ID are all in the data:
SELECT * from crashes_all_prod where socrata_id in (3959009, 3562386, 4354569)
But one of them is missing geometry.
I wonder if the real issue is some problem with the export he got, combined with some staleness in our data vs Socrata
➤ Dan Rademacher commented:
Also, none of the three they say are missing are assigned to CB107. One is 104, one is 164, and the other is null since it has no geometry.
More findings: Measure | Results | % of total missing |
---|---|---|
Total reported missing | 4546 | |
Present, w/xy | 3618 | |
Present, w/o xy | 928 | |
Present with matching XY | 3313 | 73% |
Present, missing or different xy | 1233 | 27% |
Missing | 88 | 2% |
Unique XY Crashmapper | 550 | |
Unique XY Socrata | 1172 |
Still not sure what the underlying cause is, but it feels like we need a script that uses Socrata ID to go back and update XY, and then rerun boundary intersections
Further review of coordinates strongly suggests this is an issue of later improvements in geocoding.
The community board outlines and labels show here, and the red clustered markers are from Crashmapper, while the purple markers are the client-supplied export from Socrata. It looks like CM data is heavily clustered along the edges of 107 and 164. And all of those along the edge are classed as 164.
Looking at the balance among the CARTO data I pulled using the Socrata IDs from the client, virtually all of them appear to be in 164:
Community Board | CARTO |
---|---|
107 | 18 |
164 | 2405 |
So it does seem like we'll need a fix_xy
process to go along with our fix_tallies
, but doing that on the whole database seems daunting.
Here's a map where the green diamonds are ones where CARTO matches SOCRATA though the CB assignment is ambiguous because they are on the border. The red ones are ones where updating XY will solve the issue:
So:
For example, the client reported 3 fatalities missing from CB107. These were noteworthy enough that folks might seek them out in the data. As noted above, all three are in CARTO. One is missing geometry but the other two have coordinated AND those coordinates match what is in SOCRATA. So nothing on item 1 will fix these.
As shown here, they are right on the edge and both got assigned to neighboring Community Boards. The large pink dot is CARTO and the small purple is client-supplied Socrata, and the label is the assigned CB for the point in CARTO:
Christine asked whether we could count the crashes in both CBs. For now, let's stay focused on the XY updates.
For the record, I don't see a simple way to do multi-assign crashes since we preassign Community Board with a simple st_within
query:
https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/a3fdc2ca0cff89217e964d37362ae88e31b156ac/main.py#L498
Somehow assigning to multiples, we'd need to:
This is interesting, drawing on a sample of just things around CB107:
Year | Mismatched Count |
---|---|
2021 | 58 |
2020 | 15 |
2019 | 13 |
2018 | 24 |
2017 | 89 |
2016 | 1032 |
2015 | 0 |
2014 | 1 |
2013 | 0 |
2012 | 1 |
A quick spot check of the 3 mentioned in the issue, and confirmed that their geometry matches the latest from SODA:
select
socrata_id, date_val,
ST_X(the_geom), ST_Y(the_geom), community_board
from crashes_all_prod
WHERE socrata_id IN (3959009, 3562386, 4354569)
The old Community Board assignments for those three are still accurate: | collision_id | CB | lng | lat |
---|---|---|---|---|
3959009 | 164 | -73.97874 | 40.772415 | |
4354569 | 104 | -73.9821 | 40.76889 | |
3562386 | 107 | -73.97504 | 40.79021 |
As to CB 107, short of knowing the methodology used in your prior test, hopefully these stats will be helpful?
SELECT year, COUNT(*) FROM crashes_all_prod WHERE community_board = 107 GROUP BY year ORDER BY year
year | crashes in CB 107 |
---|---|
2012 | 2168 |
2013 | 2608 |
2014 | 2537 |
2015 | 2753 |
2016 | 2675 |
2017 | 2974 |
2018 | 2796 |
2019 | 2379 |
2020 | 1038 |
2021 | 938 |
total | 23506 |
Notes:
Just spoke to client. she's happy with this level of improvement for the CB107 "Everyone has their own versions of data. If we're closer to matching that's good enough"
Forwarded by Christine:
┆Issue is synchronized with this Asana task ┆Attachments: image.png | image.png | image.png ┆Due Date: 2021-11-30