GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Create new FixCoords script, or expand FixTallies #35

Closed danrademacher closed 2 years ago

danrademacher commented 2 years ago

Over in #34 I diagnosed an issue where lots of crashes in Socrata now have different coordinates than what we have in CARTO. We suspect that NYC might sometimes regeocode crashes with poor location info as they improve their geocoder, but ultimately, we're not sure why this happens.

As a way to patch the problem, we need to develop a script that updates XY coordinates from Socrata into CARTO.

We already have a couple of methods we have used to deal with changes in tallies of injured/killeD:

  1. Every day as part of main.py we look back 90 days and update tallies. That's here: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py#L814. This runs daily so we can be sure it's correct to current schema and approaches, etc.
  2. We also "FixTallies" script that checks more broadly for changes in injured/killed data and updates CARTO. That's here: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies. This runs manually and we haven't touched it in 2 years. Not sure when we last ran it, so it may or may not work. As I recall, we used this to catch up on wrong injured/killed tallies so we could then do the 90 day update in 1.

Right now, we could have wrong coordinates going all the way back to the beginning, so it seems like we'll need some version of 2, and then include the XY update in our main.py going forward as well.

This spreadsheet tab has a list of known mismatched records, where COLLISION_ID is socrata_id.

This is interesting, drawing on a sample of just things around CB107, showing that most of the problem is in 2016:

Year  Mismatched Count
2021 58
2020 15
2019 13
2018 24
2017 89
2016 1032
2015 0
2014 1
2013 0
2012 1

Bills to CHEKPEDS:Crashmapper Ongoing.

┆Issue is synchronized with this Asana task

gregallensworth commented 2 years ago

If the 1233 rows listed in the spreadsheet, I see that 1145 of them are in CARTO

This means 88 crashes are absent from CARTO.

Of those 1145 in CARTO, 841 have the_geom IS NULL and 304 have a not-null geometry.

I would do the three phases in this sequence, so we don't re-check coordinates that we inserted/updated only a moment earlier in the previous phase.

gregallensworth commented 2 years ago

Those with differing geometry from Socrata

I have added a new OFF METERS column, with a Haversine calculation. I'm glad I did, because in my initial random test I grabbed a few that all had <1 meter error, and was about to dismiss this as a cause. But a few are several kilometers!

An extreme example is socrata_id 3463725, row 1139 in this spreadsheet - the geocode changed by nearly 12 miles!

So in the second part of the script, looking for changed latlong, a potential optimization could be updating only if either coordinate has changed by more than 0.0005. If not, then we would be shifting it by as little as 1 meters... and potentially updating the same record every day due to implicit rounding error as the coordinates are saved.

After the big run of filling in nulls back to 2016, there are 1,029,032 records dated 2016-01-01 and later, which have a not-null geometry, and therefore would be candidates to check for an update.

SELECT COUNT(*) FROM crashes_all_prod WHERE socrata_id IS NOT NULL AND date_val >= '2016-01-01T00:00:00Z' AND the_geom IS NOT NULL

A new ETL script to go over CARTO records where the_geom is not null and corresponding SODA records, find differing geoms, and update

Update: Some statistics about the diffs:

gregallensworth commented 2 years ago

Those with null geometry in CARTO

The ones which had no geometry at the time they were imported, was (if I recall) a known phenomenon, and now we know that NYC may go back and geocode them. That's good to know.

Right now 226,003 rows (11.6%) have null geom. These account for 841 of these "mismatches" (73%) as well.

Of these, 666 also have no socrata_id and are dates 2011 and 2012. Let's ignore those, since we won't have data updates for them anyway (no socrata ID).

Additionally, there are 49 rows with the_geom = 0,0 While a very small number, these should be included in the cleanup attempt.

SELECT * FROM crashes_all_prod WHERE the_geom IS NULL AND socrata_id IS NOT NULL

Yes, really, no latlong at Socrata

It's not just old data. Socrata ID 4477734 is from November 15,. only a couple of days ago.

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id=4477734

image

Also, no socrata_id before 2013

UPDATE: after updating the nulls

I created the new script fix_null_geom_in_carto.py and ran it over the course of a week. There are now 86,784 records with null geometry when counting back to 2016-01-01, and 188,336 overall. So this is a reduction of some 38,000 records, but still not 100% of what there is.

I would point out that this took a week of monitoring, resuming after an API error or network error, re-running to catch new-new updates during the week it ran, ... Definitely not a "can we run this once a week?" sort of deal.

gregallensworth commented 2 years ago

Those absent from CARTO

The most likely cause, is that these crashes were not entered until after the looking-back period. The looking-back is defined as 2 months.

FETCH_HOWMANY_MONTHS = 2  # when looking for new records in SODA, look back how many months?

I would need to sit down and narrow down the time window(s) for these missing ones, then run the "backlog" script to pick them up.

UPDATE: A full backlog run from 2016-01 through 2021-09 brings us up to 1231 out of the 1233 records, now present in CARTO. Still missing 2.

The 2 still absent missing are Socrata ID = 3790191 and 4043582

They are from 2014 and 2012 respectively, so were not included in previous backlog runs to 2015.

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id=3790191

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id=4043582

gregallensworth commented 2 years ago

Refreshing my memory too:

The fixtallies script was not an automated CARTO-and-Socrata checking tool. It accepted two CSV files of data dumps from the two, so went into it looking for a known universe of incorrect tallies as determined by visual & mental examination.

danrademacher commented 2 years ago

One open loop here is whether there's any way to make the latlng updates part of our scheduled script updates in main.py that run every day. Even though those would only run on the 90-day "lookback" it seems like this would be a good improvement to make to stay a bit more aligned with Socrata.

gregallensworth commented 2 years ago

any way to make the latlng updates part of our scheduled script updates in main.py that run every day those would only run on the 90-day "lookback"

Yes I will make it do that.

danrademacher commented 2 years ago

Hmm, OK, I thought this section of main.py went back and updated the killed and injured tallies and maybe we could do something similar for xy: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py#L871-L927

But I gotta admit that code is at the edge of my understanding, so if a separate script is better, that's fine.