Closed gregallensworth closed 6 years ago
Use a query like this, then export as CSV:
SELECT
socrata_id,
date_val::date,
number_of_persons_injured,
number_of_cyclist_injured,
number_of_motorist_injured,
number_of_pedestrian_injured,
number_of_persons_killed,
number_of_cyclist_killed,
number_of_motorist_killed,
number_of_pedestrian_killed,
borough
FROM crashes_all_prod
WHERE
socrata_id IS NOT NULL
AND date_val::date >= '2017-02-01' AND date_val::date < '2017-03-01'
ORDER BY socrata_id
Use an URL such as this
https://data.cityofnewyork.us/resource/qiz3-axqb.csv?$where=date%3E=%272017-02-01%27%20AND%20date%3C%272017-03-01%27&$order=unique_key%20ASC&$limit=50000
The two CSVs were saved as XLSX worksheets, then the fields resequenced so the two spreadsheets could be compared side by side.
Quick Stats comparison for 201702
NOTE that my counts here are not aggregating by borough, nor at this time filtering by borough being non-blank. As such, these sums will differ from the borough-aggregates listed in Dan's worksheet.
CARTO 201702.xlsx = 16077 crash records, 3971 total injuries (Persons Injured field)
SODA 201702.xlsx = 15701 crash records, 3907 total injuries (Persons Injured field)
Diff: CARTO has 376 more crashes (2% more) and 64 more injuries (2% more)
NOTE that the attached file includes data for 2017-03-01, and this should not be included in the comparisons.
Attached below is a script I wrote to compare the crash lists, and its output.
The CARTO listing for 2017-02 has 398 duplicated crashes, adding up to 85 injuries and 0 fatalities.
The CARTO listing contains 569 crashes which are not present in SODA, adding up to 139 injuries and 0 fatalities.
The noted variance is 202 injuries and 945 crashes. The variances seen above would account for 967 crashes, 224 injuries, and 0 fatalities. NOTE that this quick sum may not add up exactly, as some of the duplicates may be present in SODA or absent from SODA, affecting the final outcome of resolving these duplicates. Relevant point is that of the 5% error we're seeing, about 4 points may be accounted by these two sources noted above.
Crashes_MissingFromSODA.xlsx DuplicateCrashesInCARTO.xlsx
Also noted are 22 crashes present in SODA which were not picked up into CARTO. This can be be due to the data appearing too late after the ETL script run; it only looks back 1 month, data backlogged 3 months later would be missed, or the record being otherwise malformed. The variance here would be -22 crashes and -5 injuries and thus is not significant compared to the sources noted above.
It should be noted that there exist 105076 records where the socrata_id field is blank. That's about 8% of all records. These are all in the date range of 2011 and early 2012 (see issue #9) so are not relevant to the variances being examined here. But it is worth noting that if SODA has data from this era, it could differ significantly from what was loaded, and there's little-to-nothing that could be done about it.
Seems like a process to dedupe the content in CARTO would be reasonable. Even if that’s only a partial resolution, not counting things twice or more is just a good idea
Duplicate crashes cleaned up
Per issue #9 the duplicated socrata_id records were visually examined and one record from each pair was deleted. These were wholly within 2017-02-26, and were to the tune of 85 injuries over 400 crashes: about 10% of the crash variance and 40% of injury variance
As such, their cleanup should significantly improve the variance for 2017-02, which was the second-most-variant month noted.
New Download and Quick Stats
A new export of CARTO for 2017-02 is attached. 2017-02 is now well within acceptable variances, of well under 1%.
CARTO: 15679 crashes, 3886 injuries, 12 fatalities
SODA: 15701 crashes, 3907 injuries, 13 fatalities
0.2% variance in crash count
0.5% variance in injury count
SODA-201702.xlsx CARTO-201702.xlsx
Flaw in Previous Notes
I had mentioned in a previous note, records missing from one API but present in the other. This was an error in my export file I had accidentally included 2017-03-01 in my download from CARTO, which included some 400 crashes which were of course absent from the SODA export.
With the above export, there are:
Virtually all of the variance in 2016-02 was explained and solved by the duplicate records. However, those duplicates were dated entirely within 2017-02-26, so are not causes of variance elsewhere. Others months need separate investigation.
Export files were generated for date >= 2016-12-01 AND date < 2017-01-01
and are attached.
CARTO-201612.xlsx SODA-201612.xlsx
Quick stats are:
CARTO: 19235 crashes, 4884 injuries, 24 fatalities
SODA: 19236 crashes, 4885 injuries, 24 fatalities
The difference here is 1 single crash (with 1 single injury) present in SODA but not in CARTO, likely due to the usual "backlogged too late" effect otherwise noted.
In this case, I suggest that the variance noted in Dan's analysis, is due to a flaw in methodology. For example the Socrata Data API being used by the ETL may not be the same as the pre-digested spreadsheets which Dan was using, or aggregation by borough may not have been done accurately. (the ETL uses polygon containment, which gets a lot more hits than SODA's own field)
The most-variant month listed in Dan's analysis is 2016-08, with CARTO showing 95% more injuries than derived from the Socrata-provided spreadsheets. Export files were generated from SODA and CARTO, and are attached. Quick stats are:
CARTO: 19690 crashes, 5510 injuries, 28 fatalities
SODA: 19690 crashes, 5509 injuries, 29 fatalities
SODA-201608.xlsx CARTO-201608.xlsx
This is a variation of 1 injury and 1 fatality, out of the agreed-upon 19690 crash records. This would be considered well within acceptable variances.
I have no infrmation as to how Dan got nearly double the injuries in his analysis. The simplest explanation would be a typographical error when making the query, likely in the date range?
I then re-ran this using a CARTO query such as Dan was using, to aggregate by year-month and also by borough.
SELECT * FROM (
SELECT
year || '-' || LPAD(month::text, 2, '0') as year_month,
borough,
COUNT(c.cartodb_id) as total_crashes,
SUM(number_of_persons_injured) AS number_of_persons_injured,
SUM(number_of_persons_killed) AS number_of_persons_killed
FROM crashes_all_prod c
GROUP BY year_month, borough
ORDER BY year_month asc, borough asc
) boroughbymonth
WHERE year_month = '2016-08'
Results are:
borough number_of_persons_injured number_of_persons_killed total_crashes
none 1545 11 5276
Bronx 705 2 2184
Brooklyn 1267 4 4140
Manhattan 587 3 3094
Queens 1235 7 4209
Staten Island 171 1 788
SUM 5510 28 19691
This gives almost precisely the same results. The variation is 1 crash from the other CARTO query (using the date object instead of year-month) and a variance from SODA of 1 crash and 1 fatality.
This indicates that variance caused by the date filtering and methodology differences are trivial, and corroborates that the 11,318 tally in the original analysis was erroneous due to other methodology.
{
"total_crashes": 19691,
"cyclist_injured": 1234,
"cyclist_killed": 3,
"motorist_injured": 8467,
"motorist_killed": 7,
"pedestrian_injured": 1617,
"pedestrian_killed": 17,
"persons_injured": 11318,
"persons_killed": 27,
"year_month": "2016-08"
},
The above two comments suggest that the spikes I was seeing in my analysis is present in the application but not in the data.
I'm puzzled for sure.
Another round of checking using queries in this sheet suggests that the root of the problem is that the subtotals for pedestrians_injured
,cyclists_injured
, and motorists_injured
in CARTO are not correct, even though persons_injured
is very close.
Mode | Injured_NYC_by_mode | Injured_CARTO_by_mode |
---|---|---|
pedestrian | 883 | 1571 |
cyclist | 678 | 1176 |
motorist | 4244 | 8201 |
Based on un-aggregated crash data pulled with this query
Compared to crash export from https://data.cityofnewyork.us/Public-Safety/2016-08/rt6i-7uc9/data.
Here's what the map app shows:
I wonder if the best approach here would be to simply blow away and reimport the three most problematic months rather than trying to divine what went wrong when they were imported.
The most variant months are:
Year | Month |
---|---|
2016 | 8 |
2016 | 9 |
2017 | 1 |
I agree with Dan that "pedestrians_injured, cyclists_injured, and motorists_injured
in CARTO are not correct, even though persons_injured
is very close" My analyses above were solely with the persons_injured
field and not the sum of the three subtypes of injured party.
This step, is to examine both the SODA and CARTO data for 2016-08 to determine the difference between the persons_injured
field and the sum of the 3 injury fields (motorist, pedestrian, cyclist) to determine the nature of this mismatch. This is the most variant month noted.
If the data are badly mismatched in SODA, then reloading would achieve nothing except reloading records with the same bad math (in which 1 motorist + 2 bicyclists + 0 peds = 5 persons). But if the SODA data are nicely matched with acceptable internal variance, then perhaps reloading could make sense.
SODA-injurycrashes-201608.xlsx CARTO-injurycrashes-201608.xlsx data-fetch-methodology.txt
Results thus far:
persons_injured
and the sum of the 3 injury subtypes (11318 to 5510).SODA largest single-crash variances:
1 instance of 1 injury but 4 motorist (diff = 3)
1 instance of 9 injuries but 7 motorist (diff = 2)
Large majority are 0 diff or +1 to -1 diff.
CARTO internal variances:
Few records have an internal variance of 0, most are +1 but +_16 +18 +21 are noted
Some specifics:
3492924 SODA 5 motorist = 5 total, in CARTO as 15 motorist = 5 total
3492692 SODA 3 motorist = 3 total, in CARTO as 6 motorist = 3 total
3492614 SODA 2 motorists = 1 total (internal variance), in CARTO as 3 motorist (did they sum 2+1?)
3498009 SODA 7 motorists = 7 total, in CARTO as 28 motorist = 7 total
Something may have been significantly borked in the ETL script back then, in regards to loading the by-vehicle injury counts, and specifically the motorist injury counts. Reloading the data will be a real bear, at 4000 crashes per month, so I will need to do some creative invention trying to make that happen.
Other months are likely exhibit this same sort of variance, not only the 3 described at https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/12#issuecomment-368392100 and should be examined for these sort of issues.
The goal here is to determine whether the variances noted in 2016-08 above are similar to those affecting other months, e.g. 2017-02.
SODA-injurycrashes-201702.xlsx CARTO-injurycrashes-201702.xlsx
CARTO dataset, internal variances overview
Total variance:
3976 summed versus 3886 persons injured = 90 = 3.1%
Most variances are +1 to -1
Most extreme variances within CARTO:
3610697 CARTO 0 specified = 6 total; SAME as in in SODA
3607913 SODA 4 motorist = 4 total, in CARTO 8 motorist = 4 total
3608213 SODA 4 motorist = 4 total, in CARTO 8 motorist = 4 total
3608375 SODA 6 motorist = 6 total, in CARTO 12 motorist = 6 total
SODA dataset, internal variances overview
Total variance:
3965 summed versus 3909 injured = -56 = 1.4%
Most variances -1 (sum 1 higher than injured persons)
Most extreme variances in SODA:
3610697 0 specified = 6 total
3619210 2 motorist = 0 total
Summary
Again, the per-vehicle injury subcounts are loaded incorrectly, particularly the motorist field. Unlike 2016-08 the grand total variances are smaller overall, but the same phenomenon is present.
The next question of the ETL bugs noted above, is whether this issue is ongoing, or previously solved. This affects whether we may just patch the existing data, or must alter the ETL script to fix a longer-standing issue.
In CARTO a query was done over the 3-month period of Q4 2017, to show the most extreme internal variances. If the ETL issues have been corrected, then only several variances should exist and in the range of +1 and -1, with -18 (a 6 becoming 24) not happening.
SELECT
socrata_id,
date_val,
sum_of_persons_injured,
number_of_persons_injured,
ABS(sum_of_persons_injured - number_of_persons_injured) AS variance
FROM (
SELECT
socrata_id,
date_val::date,
number_of_persons_injured,
number_of_cyclist_injured + number_of_motorist_injured + number_of_pedestrian_injured AS sum_of_persons_injured,
number_of_cyclist_injured,
number_of_motorist_injured,
number_of_pedestrian_injured
FROM crashes_all_prod
WHERE
socrata_id IS NOT NULL
AND date_val::date >= '2017-10-01' AND date_val::date < '2017-12-01'
ORDER BY socrata_id
) monthcrashes
WHERE sum_of_persons_injured > 0 OR number_of_persons_injured > 0
ORDER BY variance DESC
Within Q4 2017, only 8 non-0 variances were found which is a great improvement compared to these months previously investigated. And, all of these variances are confirmed to also be present in SODA output. indicating that the ETL is doing what it should with the data given.
3801293 3 VS 1 total confirmed in SODA
3767170 6 VS 3 total confirmed in SODA
3761686 6 VS 4 total confirmed in SODA
3767863 2 VS 0 total confirmed in SODA
3777714 0 VS 1 total confirmed in SODA
3794374 2 VS 1 total confirmed in SODA
3790959 0 VS 1 total confirmed in SODA
3764710 1 VS 0 total confirmed in SODA
Going back as far as March 2017, the data all look fine. It is February and January 2017 which start to exhibit wider internal variances, and going further back this appears consistent enough to warrant addressing the issue prior to 2017-06.
Next course of action, to correct it. Create a script to:
Much better.
At this point we would re-run the variance processes which Dan did originally at the start of this Issue, and see if they fall within acceptable ranges, then evaluate whether further work is necessary to capture a few missing crashes.
Dan did some analyses and seems to be seeing variances in fatality counts, which weren't the focus of the prior updates. Specifically, the date range of 2017-03 through 2017-12 seems to be tallying up crashes properly and injuries properly, but it showing lower fatality counts than would be expected from the SODA data.
As before, SODA is showing some small internal variances between the persons killed field (TOTAL below) and the sum of peds, motorists, & cyclists killed (KSUM below). The CARTO and SODA tallies are given below.
MONTH TOTAL KSUM CARTO
2017-03 17 17 12
2017-04 23 23 17
2017-05 12 12 10
2017-06 21 21 16
2017-07 32 32 23
2017-08 22 23 14
2017-09 26 26 20
2017-10 25 27 17
2017-11 24 25 16
2017-12 16 16 13
Exported data for analysis
One month at a time, with the "number of persons killed" compared to the "killed, by vehicles type" subcounts. As expected, only minor internal discrepancies exist e.g. a sum of 26 compared to 27 persons.
SODA-AllCrashes-201710.xlsx SODA-AllCrashes-201711.xlsx SODA-AllCrashes-201712.xlsx
CARTO-AllCrashes-201703.xlsx CARTO-AllCrashes-201704.xlsx CARTO-AllCrashes-201705.xlsx CARTO-AllCrashes-201706.xlsx CARTO-AllCrashes-201707.xlsx CARTO-AllCrashes-201708.xlsx CARTO-AllCrashes-201709.xlsx CARTO-AllCrashes-201710.xlsx CARTO-AllCrashes-201711.xlsx CARTO-AllCrashes-201712.xlsx
SODA-AllCrashes-201703.xlsx SODA-AllCrashes-201704.xlsx SODA-AllCrashes-201705.xlsx SODA-AllCrashes-201706.xlsx SODA-AllCrashes-201707.xlsx SODA-AllCrashes-201708.xlsx SODA-AllCrashes-201709.xlsx
Quick Overview Analysis
As always, note that these are raw counts and not aggregated by borough, so counts here may not match Dan's tallies. The two sites' counts for a given month should however, match very closely, notwithstanding the rare fatality accident that was not logged, or a possible injury which "converted" into a fatality at SODA after it was logged at CARTO.
The internal variance is so small, azs to not be notable. In both CARTO and SODA, the internal variance is usually 0 and sometimes 1, over the whole month. As such, what we're seeing here is not due to the incorrect summing of sub-counts as was seen with injuries.
As such, let's focus solely on the Persons Killed field. This is a count of fatalities and fatal crashes by month, in the two sources. Both are based solely on the Persons Killed value.
SODA CARTO
2017-03 17f / 17c 12f / 12c
2017-04 23f / 22c 17f / 17c
2017-05 12f / 12c 10f / 10c
2017-06 21f / 20c 16f / 15c
2017-07 32f / 31c 23f / 22c
2017-08 23f / 23c 13f / 13c
2017-09 26f / 24c 20f / 18c
2017-10 24f / 23c 17f / 10c incl 8-fatality crash?
2017-11 24f / 23c 16f / 15c
2017-12 16f / 16c 13f / 13c
This does confirm that a significant difference exists in the Persons Killed counts between the two datasets. Fortunately, a dataset of 200 crashes in each of the two datasets should be reasonably easy to examine and compare, without resorting to heroic measures, in order to characterize these differences.
A possible cause of variance here, could be the "conversion" of an injury to a fatality, potentially days after the record was scraped from SODA into CARTO. If a record is scraped into CARTO with 3 injuries and 0 fatalities, and then is later changed in SODA to 2 injuries and 1 fatality, the record at CARTO would not be updated. If this were to happen, a tiny and barely-noticeable negative variance in injuries would result, but a significantly more noticeable positive variance in fatalities would result.
Fatality-related crashes make up a very small portion of all crashes. Within a month, the variance in the list of fatality-related crashes between the two data sources tends to be 2-5 crashes. This lends itself wel to visual crash-by-crash comparison.
SODA data were pulled today (and thus being the latest/current data), and compared to the CARTO copy of the crash record (potentially several months old, having been the newest data at the time). The goal here is to determine whether the crash records were scraped improperly into CARTO, or perhaps altered later in SODA with the latter having a predicted effect of -1 injury and +1 fatality as described above.
201703
SODA shows 5 crashes with 1 fatality apiece, accounting for the variance of 17 fatalities (SODA) vs 5 (CARTO).
SODA CARTO
3624204 2i/1f 3i/0f 1 injury-to-fatality conversion
3626043 0i/1f 1i/0f 1 injury-to-fatality conversion
3628393 0i/1f 1i/0f 1 injury-to-fatality conversion
3630562 0i/1f 1i/0f 1 injury-to-fatality conversion
3635027 0i/1f 1i/0f 1 injury-to-fatality conversion
201704
SODA shows 23 fatalities over 22 crashes (1 apiece, except 3652675 which had 2), and CARTO shows 17 fatalities over 17 crashes.
SODA CARTO
3645566 0i/1f 1i/0f 1 inj-to-fat conversion
3649309 5i/1f 6i/0f 1 inj-to-fat conversion
3649733 0i/1f 1i/0f 1 inj-to-fat conversion
3655558 3i/1f 4i/0f 1 inj-to-fat conversion
3655948 0i/1f 1i/0f 1 inj-to-fat conversion
Results
Within these first 2 months sampled of the 10 months being examined, 100% of the fatality variances are due to SODA content having been changed after it was loaded, apparently due to an injury later proving to be fatal.
Hmm, here's an interesting discovery: https://dev.socrata.com/docs/system-fields.html
SODA has hidden system fields! Including updated_at
, which allow queries like this:
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=:updated_at > '2018-01-20'
But they also include this caution:
Data providers use many different methods to update datasets. In some cases, they use tools like DataSync or the Publisher API to update datasets, and we can tell which records within the dataset have actually been modified, and only update them accordingly. When data providers perform a full replace of the dataset using the Publisher API, all of its records will be updated within short period of time, in which case a query based on :updated_at will show that all of the records have changed.
If NYC's updated_at
dates are reasonable, this might be a much better approach to data syncing.
They also have created_at
, which I admit is a bit kerflummoxing if it works, since it means all this messing about with crash dates was a red herring. OTOH, if these previously unknown fields solved our problems, then that would be a fine thing.
A quick checkup on the injury-to-fatality conversions. This hypothesis is further corroborated upon examination of the update_at field, indicating that the record was indeed modified over 3 months after the fact as well as having been entered over two months after the fact.
Crash # 3649309
Date: 2017-02-03
Logged as 6 injury & 0 fatality
Now reads 5 injury & 1 fatality
Created at: 2017-04-14
Updated At: 2017-05-11
This confirms a few theories and also gives some new information:
I have opened issue #13 to discuss the possibility of a new ETL job or offline task, to rectify altered records.
I have reopened #6 to discuss that the "one month window" may be insufficient to catch all new crashes.
Dan has put together a spreadsheet investigating the crash counts from Socarata and from CARTO, and come up with some figures indicating that in many cases the injury count in CARTO (and CrashMapper) is higher than indicated by the Socrata systems.
https://docs.google.com/spreadsheets/d/14JNTMS4gRszWSVXY0kOGXkYCWWp440Las_PE40nAAT0/edit#gid=248679534
Notably the bottom which lists the most-incorrect months.