Yet more QA on missing data

danrademacher commented 6 years ago

Christine says we're missing data. #15 should resolve a bunch of that, but she also sent this CSV and wants some data QA: https://docs.google.com/spreadsheets/d/1T9-ENN7uuhHabm3kYhl7XulO8th5qYgbZA2eaSNrOnE/edit#gid=1551094594

I responded asking her what the query was that generated this doc. I can't divine it from the data itself -- seems like random days grouped by borough.

danrademacher commented 6 years ago

From client:

Here are two data sets

NYPD_Motor_Vehicle_Collisions-FATALITIES.csv fatalities citywide from January 1st, 2018 to November 13, 2018
NYPD_Motor_Vehicle_Collisions-INJURIES.csv Injuries citywide from January 1st 2018. to November 13, 2018

20181114_crashmapperQA.zip

Here is the the source in case you want to make subsets to pinpoint the problems. https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data

I amk not certain we even have a problem. Going to ask her now that the missing data is restored.

danrademacher commented 5 years ago

OK, I repeated past QA processed and results are here:L https://docs.google.com/spreadsheets/d/1q3yi39yAEYPT-vc2NMWUCasDUUI2ASbe0azD6q_ZCEY/edit#gid=187881358

We are missing more than 3K crashes. I have isolated nearly all to May and June of this year.

@gregallensworth I reminded myself that https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/backlog exists. Is that something we could rerun on 2018 YTD and maybe pick up these missing crashes? Or is it supposedly already running and we have a new source of missing crashes?

If the former, let's try it. If the latter, I want to report back to Christine and assess level of effort vs resources.

gregallensworth commented 5 years ago

I've run the backlog.py script, it only discovered several missing crashes, not a significant chunk of what you're reporting. I also checked over the main ETL script's find_updated_killcounts() which looks for later-updated records and ran it patched to look back to February; it too found only several variances, not a significant proportion of what you are reporting. I then ran some modified examinations of the ETL script for the period of 2018-05 thru 2018-06 and came up with nothing new.

During my examination though, I did find some crashes which had a date_entered in November 2018 but a date of April 2016. While not directly relevant to May 2018 right now, it is disturbing to note that come 2020 if we were to run SODA/CARTO variances for 2018, we may find new "missing" records where NYC waited 2 years to enter them.

To attack the current variance, though, I'll need to move on to the tedious step downloading a dump of every single crash in 2018-05 from CARTO and from SODA, then comparing them, to see what I can track down.

gregallensworth commented 5 years ago

ETL Results, Raw Comparison

I fetched raw lists of crashes from SODA and CARTO (see fixtallies documentation for queries used) in CSV format, then removed and reorganized columns in the SODA CSVs to match those of CARTO CSVs. I then added sums for both the 3 injury-by-person and fatality-by-person counts.

CrashExports-201805_201806.xlsx

201805        SODA       CARTO     DIFF
Crashes      20720       20721     1 - OK
Total Inj     5647        5642     5 - OK
Total Kil       13          13     0 - OK
Cyc Inj        462         281     181 - BAD
Mot Inj       4275        2892     1383 - BAD
Ped Inj        877         591     1368 - BAD
Cyc Kil          1           0     1 - OK
Mot Kil          5           2     3 - CHECK, may be "conversions"
Ped Kil          7           6     1 - OK

201806        SODA       CARTO     DIFF
Crashes      20672       20670     2 - OK
Total Inj     5590        5591     0 - OK
Total Kil       21          21     0 - OK
Cyc Inj        528         377     151 - BAD
Mot Inj       4228        3128     1100 - BAD
Ped Inj        800         592     208 - BAD
Cyc Kil          2           2     0 - OK
Mot Kil          9           8     1 - OK
Ped Kil         10          10     0 - OK

What strikes me is that

Kill counts show only small variance, consistent with an injury later proving to be fatal ("conversion" as I call it) and the record being retroactively updated
The crash count is spot on, as are the total persons killed and injured
Injury counts are hugely under-reported despite the number of crashes being on target

It looks as if there has been a significant change in the data, that the detailed injures (whether the injured persons were motorist, pedestrian, or cyclist) have been filled in retroactively for a whole bunch of records where only the total persons had been filled in. A possible scenario here is that NYPD logged only "3 persons injured" with no details, and later research discovered that this was 2 pedestrians and a cyclist.

gregallensworth commented 5 years ago

Details review

Thumbing passively through June 2018, as a general rule I am seeing a lot of persons injured but with 0/0/0 for the cyc/ped/mot counts (no actual tallies as that would take time, but a general impression by sorting the rows and skipping to those where persons injured >0 and the other 3 are all 0). As such, the persons_injured is significantly different from the actual sum of the 3 injury "roles".

Crash ID 3912235 in June 2018 2 persons injured, but 0/0/0 cyc/mot/ped later revised to 2 motorist

Crash ID 3913343 in June 2018 3 persons injured, but 0/0/0 cyc/mot/ped later revised to 3 motorist

Crash ID 3912547 in June 2018 2 persons injured, but 0/0/0 cyc/mot/ped later revised to 1 motorist (still not 2 total!)

Crash ID 3912770 in June 2018 3 persons injured, 0/0/0 details later revised to 3 motorist

Crash ID 3913183 in June 2018 3 persons, 0/0/0 detail later revised to 3 pedestrian injuries

And so on and so on. That was just passive thumbing through the first few hundred rows... Taken 1-2 variants at a time over maybe 5,000 injurious crashes, this would certainly account for what we are seeing.

A dashboard/view based on the total persons fields, would seem to over-report relative to a more detailed "by role" view, as it includes even people whose role in the crash was unknown.
While the CrashMapper would seem to under-report, by tallying only injuries/fatalities where their role was known.

gregallensworth commented 5 years ago

Fixtallies to Fix Tallies

I fired up the fixtallies script, and fed it full exports of CARTO and SODA dated 2015-01-01 through 2018-10-01 The difference is 3057 crashes in which the tallies are different in CARTO than they now appear in SODA. After running this, the numbers for May + June 2018 match perfectly.

Fatalities for this 2-month period: 34, exactly matching SODA

Injuries for this 2-month period: 11,170 exactly matching SODA.

      Persons   Cyc     Mot   Ped     Sum
SODA    11237   990   8503    1677    11170
CARTO   11237   990    8504   1676    11170

In fact the only diff at all is one record updated last night in NYC/SODA, in which an injured person was reclassified from Pedestrian to Motorist.

GreenInfo-Network / nyc-crash-mapper-etl-script