Closed danrademacher closed 5 years ago
From client:
Here are two data sets
NYPD_Motor_Vehicle_Collisions-FATALITIES.csv fatalities citywide from January 1st, 2018 to November 13, 2018
NYPD_Motor_Vehicle_Collisions-INJURIES.csv Injuries citywide from January 1st 2018. to November 13, 2018
Here is the the source in case you want to make subsets to pinpoint the problems. https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data
I amk not certain we even have a problem. Going to ask her now that the missing data is restored.
OK, I repeated past QA processed and results are here:L https://docs.google.com/spreadsheets/d/1q3yi39yAEYPT-vc2NMWUCasDUUI2ASbe0azD6q_ZCEY/edit#gid=187881358
We are missing more than 3K crashes. I have isolated nearly all to May and June of this year.
@gregallensworth I reminded myself that https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/backlog exists. Is that something we could rerun on 2018 YTD and maybe pick up these missing crashes? Or is it supposedly already running and we have a new source of missing crashes?
If the former, let's try it. If the latter, I want to report back to Christine and assess level of effort vs resources.
I've run the backlog.py
script, it only discovered several missing crashes, not a significant chunk of what you're reporting. I also checked over the main ETL script's find_updated_killcounts()
which looks for later-updated records and ran it patched to look back to February; it too found only several variances, not a significant proportion of what you are reporting. I then ran some modified examinations of the ETL script for the period of 2018-05 thru 2018-06 and came up with nothing new.
During my examination though, I did find some crashes which had a date_entered
in November 2018 but a date
of April 2016. While not directly relevant to May 2018 right now, it is disturbing to note that come 2020 if we were to run SODA/CARTO variances for 2018, we may find new "missing" records where NYC waited 2 years to enter them.
To attack the current variance, though, I'll need to move on to the tedious step downloading a dump of every single crash in 2018-05 from CARTO and from SODA, then comparing them, to see what I can track down.
I fetched raw lists of crashes from SODA and CARTO (see fixtallies documentation for queries used) in CSV format, then removed and reorganized columns in the SODA CSVs to match those of CARTO CSVs. I then added sums for both the 3 injury-by-person and fatality-by-person counts.
CrashExports-201805_201806.xlsx
201805 SODA CARTO DIFF
Crashes 20720 20721 1 - OK
Total Inj 5647 5642 5 - OK
Total Kil 13 13 0 - OK
Cyc Inj 462 281 181 - BAD
Mot Inj 4275 2892 1383 - BAD
Ped Inj 877 591 1368 - BAD
Cyc Kil 1 0 1 - OK
Mot Kil 5 2 3 - CHECK, may be "conversions"
Ped Kil 7 6 1 - OK
201806 SODA CARTO DIFF
Crashes 20672 20670 2 - OK
Total Inj 5590 5591 0 - OK
Total Kil 21 21 0 - OK
Cyc Inj 528 377 151 - BAD
Mot Inj 4228 3128 1100 - BAD
Ped Inj 800 592 208 - BAD
Cyc Kil 2 2 0 - OK
Mot Kil 9 8 1 - OK
Ped Kil 10 10 0 - OK
What strikes me is that
It looks as if there has been a significant change in the data, that the detailed injures (whether the injured persons were motorist, pedestrian, or cyclist) have been filled in retroactively for a whole bunch of records where only the total persons had been filled in. A possible scenario here is that NYPD logged only "3 persons injured" with no details, and later research discovered that this was 2 pedestrians and a cyclist.
Thumbing passively through June 2018, as a general rule I am seeing a lot of persons injured but with 0/0/0 for the cyc/ped/mot counts (no actual tallies as that would take time, but a general impression by sorting the rows and skipping to those where persons injured >0 and the other 3 are all 0). As such, the persons_injured
is significantly different from the actual sum of the 3 injury "roles".
Crash ID 3912235 in June 2018 2 persons injured, but 0/0/0 cyc/mot/ped later revised to 2 motorist
Crash ID 3913343 in June 2018 3 persons injured, but 0/0/0 cyc/mot/ped later revised to 3 motorist
Crash ID 3912547 in June 2018 2 persons injured, but 0/0/0 cyc/mot/ped later revised to 1 motorist (still not 2 total!)
Crash ID 3912770 in June 2018 3 persons injured, 0/0/0 details later revised to 3 motorist
Crash ID 3913183 in June 2018 3 persons, 0/0/0 detail later revised to 3 pedestrian injuries
And so on and so on. That was just passive thumbing through the first few hundred rows... Taken 1-2 variants at a time over maybe 5,000 injurious crashes, this would certainly account for what we are seeing.
I fired up the fixtallies
script, and fed it full exports of CARTO and SODA dated 2015-01-01 through 2018-10-01 The difference is 3057 crashes in which the tallies are different in CARTO than they now appear in SODA. After running this, the numbers for May + June 2018 match perfectly.
Fatalities for this 2-month period: 34, exactly matching SODA
Injuries for this 2-month period: 11,170 exactly matching SODA.
Persons Cyc Mot Ped Sum
SODA 11237 990 8503 1677 11170
CARTO 11237 990 8504 1676 11170
In fact the only diff at all is one record updated last night in NYC/SODA, in which an injured person was reclassified from Pedestrian to Motorist.
Christine says we're missing data. #15 should resolve a bunch of that, but she also sent this CSV and wants some data QA: https://docs.google.com/spreadsheets/d/1T9-ENN7uuhHabm3kYhl7XulO8th5qYgbZA2eaSNrOnE/edit#gid=1551094594
I responded asking her what the query was that generated this doc. I can't divine it from the data itself -- seems like random days grouped by borough.