GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Crash counts mismatch between CARTO & SODA #12

Closed gregallensworth closed 6 years ago

gregallensworth commented 6 years ago

Dan has put together a spreadsheet investigating the crash counts from Socarata and from CARTO, and come up with some figures indicating that in many cases the injury count in CARTO (and CrashMapper) is higher than indicated by the Socrata systems.

https://docs.google.com/spreadsheets/d/14JNTMS4gRszWSVXY0kOGXkYCWWp440Las_PE40nAAT0/edit#gid=248679534

Notably the bottom which lists the most-incorrect months.

Year    Month   Injured_NYC Injured_CARTO   Killed_NYC  Killed_CARTO
2017    2   3907        4061        13      13              CARTO extra 100 injuries
2016    12  4884        5580        24      26              CARTO extra 700 injuries
2016    8   5509        11318       29      27              CARTO nearly 2X the injuries
gregallensworth commented 6 years ago

Exporting from CARTO

Use a query like this, then export as CSV:

SELECT
socrata_id,
date_val::date,
number_of_persons_injured,
number_of_cyclist_injured,
number_of_motorist_injured,
number_of_pedestrian_injured,
number_of_persons_killed,
number_of_cyclist_killed,
number_of_motorist_killed,
number_of_pedestrian_killed,
borough
FROM crashes_all_prod
WHERE
socrata_id IS NOT NULL
AND date_val::date >= '2017-02-01' AND date_val::date < '2017-03-01'
ORDER BY socrata_id

Exporting from SODA

Use an URL such as this

https://data.cityofnewyork.us/resource/qiz3-axqb.csv?$where=date%3E=%272017-02-01%27%20AND%20date%3C%272017-03-01%27&$order=unique_key%20ASC&$limit=50000

Postprocessing

The two CSVs were saved as XLSX worksheets, then the fields resequenced so the two spreadsheets could be compared side by side.

SODA-201702.xlsx CARTO-201702.xlsx

gregallensworth commented 6 years ago

Quick Stats comparison for 201702

NOTE that my counts here are not aggregating by borough, nor at this time filtering by borough being non-blank. As such, these sums will differ from the borough-aggregates listed in Dan's worksheet.

CARTO 201702.xlsx = 16077 crash records, 3971 total injuries (Persons Injured field)
SODA 201702.xlsx = 15701 crash records, 3907 total injuries (Persons Injured field)

Diff: CARTO has 376 more crashes (2% more) and 64 more injuries (2% more)

NOTE that the attached file includes data for 2017-03-01, and this should not be included in the comparisons.

gregallensworth commented 6 years ago

Crashes Duplicated in CARTO / Crashes Missing in SODA

Attached below is a script I wrote to compare the crash lists, and its output.

The CARTO listing for 2017-02 has 398 duplicated crashes, adding up to 85 injuries and 0 fatalities.

The CARTO listing contains 569 crashes which are not present in SODA, adding up to 139 injuries and 0 fatalities.

The noted variance is 202 injuries and 945 crashes. The variances seen above would account for 967 crashes, 224 injuries, and 0 fatalities. NOTE that this quick sum may not add up exactly, as some of the duplicates may be present in SODA or absent from SODA, affecting the final outcome of resolving these duplicates. Relevant point is that of the 5% error we're seeing, about 4 points may be accounted by these two sources noted above.

compare_crash_lists.py.txt

Crashes_MissingFromSODA.xlsx DuplicateCrashesInCARTO.xlsx

Crashes Not Yet In CARTO

Also noted are 22 crashes present in SODA which were not picked up into CARTO. This can be be due to the data appearing too late after the ETL script run; it only looks back 1 month, data backlogged 3 months later would be missed, or the record being otherwise malformed. The variance here would be -22 crashes and -5 injuries and thus is not significant compared to the sources noted above.

Crashes_MissingFromCARTO.xlsx

Crashes Without Socrata IDs

It should be noted that there exist 105076 records where the socrata_id field is blank. That's about 8% of all records. These are all in the date range of 2011 and early 2012 (see issue #9) so are not relevant to the variances being examined here. But it is worth noting that if SODA has data from this era, it could differ significantly from what was loaded, and there's little-to-nothing that could be done about it.

danrademacher commented 6 years ago

Seems like a process to dedupe the content in CARTO would be reasonable. Even if that’s only a partial resolution, not counting things twice or more is just a good idea

gregallensworth commented 6 years ago

Duplicate crashes cleaned up

Per issue #9 the duplicated socrata_id records were visually examined and one record from each pair was deleted. These were wholly within 2017-02-26, and were to the tune of 85 injuries over 400 crashes: about 10% of the crash variance and 40% of injury variance

As such, their cleanup should significantly improve the variance for 2017-02, which was the second-most-variant month noted.

New Download and Quick Stats

A new export of CARTO for 2017-02 is attached. 2017-02 is now well within acceptable variances, of well under 1%.

CARTO: 15679 crashes, 3886 injuries, 12 fatalities
SODA:  15701 crashes, 3907 injuries, 13 fatalities

0.2% variance in crash count
0.5% variance in injury count

SODA-201702.xlsx CARTO-201702.xlsx

Flaw in Previous Notes

I had mentioned in a previous note, records missing from one API but present in the other. This was an error in my export file I had accidentally included 2017-03-01 in my download from CARTO, which included some 400 crashes which were of course absent from the SODA export.

With the above export, there are:

gregallensworth commented 6 years ago

Virtually all of the variance in 2016-02 was explained and solved by the duplicate records. However, those duplicates were dated entirely within 2017-02-26, so are not causes of variance elsewhere. Others months need separate investigation.

New month: 2016-12

Export files were generated for date >= 2016-12-01 AND date < 2017-01-01 and are attached.

CARTO-201612.xlsx SODA-201612.xlsx

Quick stats are:

CARTO: 19235 crashes, 4884 injuries, 24 fatalities
SODA: 19236 crashes, 4885 injuries, 24 fatalities

The difference here is 1 single crash (with 1 single injury) present in SODA but not in CARTO, likely due to the usual "backlogged too late" effect otherwise noted.

In this case, I suggest that the variance noted in Dan's analysis, is due to a flaw in methodology. For example the Socrata Data API being used by the ETL may not be the same as the pre-digested spreadsheets which Dan was using, or aggregation by borough may not have been done accurately. (the ETL uses polygon containment, which gets a lot more hits than SODA's own field)

gregallensworth commented 6 years ago

Most Variant Month: 2016-08

The most-variant month listed in Dan's analysis is 2016-08, with CARTO showing 95% more injuries than derived from the Socrata-provided spreadsheets. Export files were generated from SODA and CARTO, and are attached. Quick stats are:

CARTO: 19690 crashes, 5510 injuries, 28 fatalities
SODA: 19690 crashes, 5509 injuries, 29 fatalities

SODA-201608.xlsx CARTO-201608.xlsx

This is a variation of 1 injury and 1 fatality, out of the agreed-upon 19690 crash records. This would be considered well within acceptable variances.

I have no infrmation as to how Dan got nearly double the injuries in his analysis. The simplest explanation would be a typographical error when making the query, likely in the date range?

2016-08 Redux

I then re-ran this using a CARTO query such as Dan was using, to aggregate by year-month and also by borough.

SELECT * FROM (
    SELECT
        year || '-' || LPAD(month::text, 2, '0') as year_month,
        borough,
        COUNT(c.cartodb_id) as total_crashes,
        SUM(number_of_persons_injured) AS number_of_persons_injured,
        SUM(number_of_persons_killed) AS number_of_persons_killed
    FROM crashes_all_prod c
    GROUP BY year_month, borough
    ORDER BY year_month asc, borough asc
) boroughbymonth
WHERE year_month = '2016-08'

Results are:

borough     number_of_persons_injured   number_of_persons_killed    total_crashes

none            1545                        11                          5276
Bronx            705                         2                          2184
Brooklyn        1267                         4                          4140
Manhattan        587                         3                          3094
Queens          1235                         7                          4209
Staten Island    171                         1                           788

SUM             5510                        28                         19691

This gives almost precisely the same results. The variation is 1 crash from the other CARTO query (using the date object instead of year-month) and a variance from SODA of 1 crash and 1 fatality.

This indicates that variance caused by the date filtering and methodology differences are trivial, and corroborates that the 11,318 tally in the original analysis was erroneous due to other methodology.

danrademacher commented 6 years ago

hmm, but in the app: image

http://vis.crashmapper.org/?cfat=true&cinj=true&geo=city_council&mfat=true&minj=true&p1end=2018-02&p1start=2017-02&p2end=2016-09&p2start=2015-09&pfat=true&pinj=true&primary&reference=citywide&view=trend

danrademacher commented 6 years ago
{
"total_crashes": 19691,
"cyclist_injured": 1234,
"cyclist_killed": 3,
"motorist_injured": 8467,
"motorist_killed": 7,
"pedestrian_injured": 1617,
"pedestrian_killed": 17,
"persons_injured": 11318,
"persons_killed": 27,
"year_month": "2016-08"
},

query: https://chekpeds.carto.com/api/v2/sql?q=SELECT+COUNT(c.cartodb_id)+as+total_crashes,+SUM(c.number_of_cyclist_injured)+as+cyclist_injured,+SUM(c.number_of_cyclist_killed)+as+cyclist_killed,+SUM(c.number_of_motorist_injured)+as+motorist_injured,+SUM(c.number_of_motorist_killed)+as+motorist_killed,+SUM(c.number_of_pedestrian_injured)+as+pedestrian_injured,+SUM(c.number_of_pedestrian_killed)+as+pedestrian_killed,+SUM(c.number_of_pedestrian_injured+%2B+c.number_of_cyclist_injured+%2B+c.number_of_motorist_injured)+as+persons_injured,+SUM(c.number_of_pedestrian_killed+%2B+c.number_of_cyclist_killed+%2B+c.number_of_motorist_killed)+as+persons_killed,+year+%7C%7C+%27-%27+%7C%7C+LPAD(month::text,+2,+%270%27)+as+year_month+FROM+crashes_all_prod+c+GROUP+BY+year,+month+ORDER+BY+year+asc,+month+asc

danrademacher commented 6 years ago

The above two comments suggest that the spikes I was seeing in my analysis is present in the application but not in the data.

I'm puzzled for sure.

danrademacher commented 6 years ago

Another round of checking using queries in this sheet suggests that the root of the problem is that the subtotals for pedestrians_injured,cyclists_injured, and motorists_injured in CARTO are not correct, even though persons_injured is very close.

Mode Injured_NYC_by_mode Injured_CARTO_by_mode
pedestrian 883  1571
cyclist 678 1176
motorist 4244 8201

Based on un-aggregated crash data pulled with this query

Compared to crash export from https://data.cityofnewyork.us/Public-Safety/2016-08/rt6i-7uc9/data.

Here's what the map app shows: image

I wonder if the best approach here would be to simply blow away and reimport the three most problematic months rather than trying to divine what went wrong when they were imported.

The most variant months are:

Year Month
2016 8
2016 9
2017 1
gregallensworth commented 6 years ago

I agree with Dan that "pedestrians_injured, cyclists_injured, and motorists_injured in CARTO are not correct, even though persons_injured is very close" My analyses above were solely with the persons_injured field and not the sum of the three subtypes of injured party.

2016-08

This step, is to examine both the SODA and CARTO data for 2016-08 to determine the difference between the persons_injured field and the sum of the 3 injury fields (motorist, pedestrian, cyclist) to determine the nature of this mismatch. This is the most variant month noted.

If the data are badly mismatched in SODA, then reloading would achieve nothing except reloading records with the same bad math (in which 1 motorist + 2 bicyclists + 0 peds = 5 persons). But if the SODA data are nicely matched with acceptable internal variance, then perhaps reloading could make sense.

SODA-injurycrashes-201608.xlsx CARTO-injurycrashes-201608.xlsx data-fetch-methodology.txt

Results thus far:

SODA largest single-crash variances:

1 instance of 1 injury but 4 motorist (diff = 3)
1 instance of 9 injuries but 7 motorist (diff = 2)
Large majority are 0 diff or +1 to -1 diff.

CARTO internal variances:

Few records have an internal variance of 0, most are +1 but +_16 +18 +21 are noted

Some specifics:

3492924     SODA 5 motorist = 5 total, in CARTO as 15 motorist = 5 total
3492692     SODA 3 motorist = 3 total, in CARTO as 6 motorist = 3 total
3492614     SODA 2 motorists = 1 total (internal variance), in CARTO as 3 motorist (did they sum 2+1?)
3498009     SODA 7 motorists = 7 total, in CARTO as 28 motorist = 7 total

Summary of Findings

Something may have been significantly borked in the ETL script back then, in regards to loading the by-vehicle injury counts, and specifically the motorist injury counts. Reloading the data will be a real bear, at 4000 crashes per month, so I will need to do some creative invention trying to make that happen.

Other months are likely exhibit this same sort of variance, not only the 3 described at https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/12#issuecomment-368392100 and should be examined for these sort of issues.

gregallensworth commented 6 years ago

2017-02

The goal here is to determine whether the variances noted in 2016-08 above are similar to those affecting other months, e.g. 2017-02.

SODA-injurycrashes-201702.xlsx CARTO-injurycrashes-201702.xlsx

CARTO dataset, internal variances overview

Total variance:

3976 summed versus 3886 persons injured = 90 = 3.1%

Most variances are +1 to -1

Most extreme variances within CARTO:

3610697     CARTO 0 specified = 6 total; SAME as in in SODA
3607913     SODA 4 motorist = 4 total, in CARTO 8 motorist = 4 total
3608213     SODA 4 motorist = 4 total, in CARTO 8 motorist = 4 total
3608375     SODA 6 motorist = 6 total, in CARTO 12 motorist = 6 total

SODA dataset, internal variances overview

Total variance:

3965 summed versus 3909 injured = -56 = 1.4%

Most variances -1 (sum 1 higher than injured persons)

Most extreme variances in SODA:

3610697     0 specified = 6 total
3619210     2 motorist = 0 total

Summary

Again, the per-vehicle injury subcounts are loaded incorrectly, particularly the motorist field. Unlike 2016-08 the grand total variances are smaller overall, but the same phenomenon is present.

gregallensworth commented 6 years ago

Q4 2017

The next question of the ETL bugs noted above, is whether this issue is ongoing, or previously solved. This affects whether we may just patch the existing data, or must alter the ETL script to fix a longer-standing issue.

In CARTO a query was done over the 3-month period of Q4 2017, to show the most extreme internal variances. If the ETL issues have been corrected, then only several variances should exist and in the range of +1 and -1, with -18 (a 6 becoming 24) not happening.

SELECT
    socrata_id,
    date_val,
    sum_of_persons_injured,
    number_of_persons_injured,
    ABS(sum_of_persons_injured - number_of_persons_injured) AS variance
FROM (
    SELECT
        socrata_id,
        date_val::date,
        number_of_persons_injured,
        number_of_cyclist_injured + number_of_motorist_injured + number_of_pedestrian_injured AS sum_of_persons_injured,
        number_of_cyclist_injured,
        number_of_motorist_injured,
        number_of_pedestrian_injured
    FROM crashes_all_prod
    WHERE
    socrata_id IS NOT NULL
    AND date_val::date >= '2017-10-01' AND date_val::date < '2017-12-01'
    ORDER BY socrata_id
) monthcrashes
WHERE sum_of_persons_injured > 0 OR number_of_persons_injured > 0
ORDER BY variance DESC

Within Q4 2017, only 8 non-0 variances were found which is a great improvement compared to these months previously investigated. And, all of these variances are confirmed to also be present in SODA output. indicating that the ETL is doing what it should with the data given.

3801293         3 VS 1 total    confirmed in SODA
3767170         6 VS 3 total    confirmed in SODA
3761686         6 VS 4 total    confirmed in SODA
3767863         2 VS 0 total    confirmed in SODA
3777714         0 VS 1 total    confirmed in SODA
3794374         2 VS 1 total    confirmed in SODA
3790959         0 VS 1 total    confirmed in SODA
3764710         1 VS 0 total    confirmed in SODA

Going back as far as March 2017, the data all look fine. It is February and January 2017 which start to exhibit wider internal variances, and going further back this appears consistent enough to warrant addressing the issue prior to 2017-06.

Next course of action, to correct it. Create a script to:

gregallensworth commented 6 years ago

Much better.

image

At this point we would re-run the variance processes which Dan did originally at the start of this Issue, and see if they fall within acceptable ranges, then evaluate whether further work is necessary to capture a few missing crashes.

gregallensworth commented 6 years ago

Doing it over, eye on fatalities

Dan did some analyses and seems to be seeing variances in fatality counts, which weren't the focus of the prior updates. Specifically, the date range of 2017-03 through 2017-12 seems to be tallying up crashes properly and injuries properly, but it showing lower fatality counts than would be expected from the SODA data.

As before, SODA is showing some small internal variances between the persons killed field (TOTAL below) and the sum of peds, motorists, & cyclists killed (KSUM below). The CARTO and SODA tallies are given below.

MONTH     TOTAL  KSUM     CARTO
2017-03    17      17        12
2017-04    23      23        17
2017-05    12      12        10
2017-06    21      21        16
2017-07    32      32        23
2017-08    22      23        14
2017-09    26      26        20
2017-10    25      27        17
2017-11    24      25        16
2017-12    16      16        13

Exported data for analysis

One month at a time, with the "number of persons killed" compared to the "killed, by vehicles type" subcounts. As expected, only minor internal discrepancies exist e.g. a sum of 26 compared to 27 persons.

SODA-AllCrashes-201710.xlsx SODA-AllCrashes-201711.xlsx SODA-AllCrashes-201712.xlsx

CARTO-AllCrashes-201703.xlsx CARTO-AllCrashes-201704.xlsx CARTO-AllCrashes-201705.xlsx CARTO-AllCrashes-201706.xlsx CARTO-AllCrashes-201707.xlsx CARTO-AllCrashes-201708.xlsx CARTO-AllCrashes-201709.xlsx CARTO-AllCrashes-201710.xlsx CARTO-AllCrashes-201711.xlsx CARTO-AllCrashes-201712.xlsx

SODA-AllCrashes-201703.xlsx SODA-AllCrashes-201704.xlsx SODA-AllCrashes-201705.xlsx SODA-AllCrashes-201706.xlsx SODA-AllCrashes-201707.xlsx SODA-AllCrashes-201708.xlsx SODA-AllCrashes-201709.xlsx

Quick Overview Analysis

As always, note that these are raw counts and not aggregated by borough, so counts here may not match Dan's tallies. The two sites' counts for a given month should however, match very closely, notwithstanding the rare fatality accident that was not logged, or a possible injury which "converted" into a fatality at SODA after it was logged at CARTO.

The internal variance is so small, azs to not be notable. In both CARTO and SODA, the internal variance is usually 0 and sometimes 1, over the whole month. As such, what we're seeing here is not due to the incorrect summing of sub-counts as was seen with injuries.

As such, let's focus solely on the Persons Killed field. This is a count of fatalities and fatal crashes by month, in the two sources. Both are based solely on the Persons Killed value.

                SODA            CARTO
2017-03         17f / 17c       12f / 12c
2017-04         23f / 22c       17f / 17c
2017-05         12f / 12c       10f / 10c
2017-06         21f / 20c       16f / 15c
2017-07         32f / 31c       23f / 22c
2017-08         23f / 23c       13f / 13c
2017-09         26f / 24c       20f / 18c
2017-10         24f / 23c       17f / 10c    incl 8-fatality crash?
2017-11         24f / 23c       16f / 15c
2017-12         16f / 16c       13f / 13c

This does confirm that a significant difference exists in the Persons Killed counts between the two datasets. Fortunately, a dataset of 200 crashes in each of the two datasets should be reasonably easy to examine and compare, without resorting to heroic measures, in order to characterize these differences.

A possible cause of variance here, could be the "conversion" of an injury to a fatality, potentially days after the record was scraped from SODA into CARTO. If a record is scraped into CARTO with 3 injuries and 0 fatalities, and then is later changed in SODA to 2 injuries and 1 fatality, the record at CARTO would not be updated. If this were to happen, a tiny and barely-noticeable negative variance in injuries would result, but a significantly more noticeable positive variance in fatalities would result.

gregallensworth commented 6 years ago

Fatality counts: Specific crash-by-crash comparisons

Fatality-related crashes make up a very small portion of all crashes. Within a month, the variance in the list of fatality-related crashes between the two data sources tends to be 2-5 crashes. This lends itself wel to visual crash-by-crash comparison.

SODA data were pulled today (and thus being the latest/current data), and compared to the CARTO copy of the crash record (potentially several months old, having been the newest data at the time). The goal here is to determine whether the crash records were scraped improperly into CARTO, or perhaps altered later in SODA with the latter having a predicted effect of -1 injury and +1 fatality as described above.

201703

SODA shows 5 crashes with 1 fatality apiece, accounting for the variance of 17 fatalities (SODA) vs 5 (CARTO).

             SODA      CARTO
3624204      2i/1f     3i/0f   1 injury-to-fatality conversion
3626043      0i/1f     1i/0f   1 injury-to-fatality conversion
3628393      0i/1f     1i/0f   1 injury-to-fatality conversion
3630562      0i/1f     1i/0f   1 injury-to-fatality conversion
3635027      0i/1f     1i/0f   1 injury-to-fatality conversion

201704

SODA shows 23 fatalities over 22 crashes (1 apiece, except 3652675 which had 2), and CARTO shows 17 fatalities over 17 crashes.

           SODA      CARTO
3645566    0i/1f     1i/0f     1 inj-to-fat conversion
3649309    5i/1f     6i/0f     1 inj-to-fat conversion
3649733    0i/1f     1i/0f     1 inj-to-fat conversion
3655558    3i/1f     4i/0f     1 inj-to-fat conversion
3655948    0i/1f     1i/0f     1 inj-to-fat conversion

Results

Within these first 2 months sampled of the 10 months being examined, 100% of the fatality variances are due to SODA content having been changed after it was loaded, apparently due to an injury later proving to be fatal.

danrademacher commented 6 years ago

Hmm, here's an interesting discovery: https://dev.socrata.com/docs/system-fields.html

SODA has hidden system fields! Including updated_at, which allow queries like this: https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=:updated_at > '2018-01-20'

But they also include this caution:

Data providers use many different methods to update datasets. In some cases, they use tools like DataSync or the Publisher API to update datasets, and we can tell which records within the dataset have actually been modified, and only update them accordingly. When data providers perform a full replace of the dataset using the Publisher API, all of its records will be updated within short period of time, in which case a query based on :updated_at will show that all of the records have changed.

If NYC's updated_at dates are reasonable, this might be a much better approach to data syncing.

They also have created_at, which I admit is a bit kerflummoxing if it works, since it means all this messing about with crash dates was a red herring. OTOH, if these previously unknown fields solved our problems, then that would be a fine thing.

gregallensworth commented 6 years ago

A quick checkup on the injury-to-fatality conversions. This hypothesis is further corroborated upon examination of the update_at field, indicating that the record was indeed modified over 3 months after the fact as well as having been entered over two months after the fact.

Crash # 3649309
Date: 2017-02-03

Logged as 6 injury & 0 fatality
Now reads 5 injury & 1 fatality

Created at: 2017-04-14
Updated At: 2017-05-11

This confirms a few theories and also gives some new information:

gregallensworth commented 6 years ago

I have opened issue #13 to discuss the possibility of a new ETL job or offline task, to rectify altered records.

I have reopened #6 to discuss that the "one month window" may be insufficient to catch all new crashes.

danrademacher commented 6 years ago

and I verified: image

https://docs.google.com/spreadsheets/d/1Xxe4FlnbOMOTlC3iAdZz4oliEX0a9Scu0X8JJ49Pbpc/edit#gid=248679534