Data vetting and checkup: non-uniquesocrata_id and missing socrata_id

gregallensworth commented 6 years ago

As long as I've been put in charge of crossing the Ts and dotting the Is, I checked up that the socrata_id field is in fact unique. However, it appears not to be.

At this point, the socrata_id is not required to be unique, only to be present. It is used only by the ETL script (#6) to determine that a SODA record has already been seen, and not to import it. Under these conditions, the non-unique or null values are not problematic. But they should be investigated to determine whether there exists some deeper issue in ETL, or just a QC issue during initial data loading which is not likely to recur.

Query:

SELECT * FROM (
  SELECT socrata_id, COUNT(*) AS howmany FROM crashes_all_prod GROUP BY socrata_id
  ) chex
WHERE howmany > 1
ORDER BY howmany DESC

Results:

105076 records with a null value in socrata_id

398 socrata_ids which are associated to 2 records.

No records with a non-null socrata_id had more than 2 such instances. (repeats existed, but no double repeats, triple repeats, ...)

gregallensworth commented 6 years ago

The list of socrata_id values which are not unique, is attached. (too big to paste)

nonunique socrata_id.xlsx

gregallensworth commented 6 years ago

Duplicates

A random set of 5 of them:

3621708

SELECT * FROM crashes_all_prod WHERE socrata_id = 3621708

Resulting carto_db IDs: 1972989 and 1713000

Records had same date, street names, injury + fatality counts, precinct, burough, neighborhood.

Lower-numbered record contributing_factor field had different quoting: it included "quotes" around "Turning Improperly".

Lower-numbered record vehicle_type is the contributing_factor string, complete with weird quoting.

Data import error?

3622158

SELECT * FROM crashes_all_prod WHERE socrata_id = 3622158

Resulting carto_db IDs: 1973263 and 1713264

Same as above: records are identical except for lower-numbered record quoting on contributing_factor and the vehicle_type being a contributing_factor string.

3622166

SELECT * FROM crashes_all_prod WHERE socrata_id = 3622166

Resulting carto_db IDs: 1973157 and 1713163

Same as above: records are identical except for lower-numbered record quoting on contributing_factor and the vehicle_type being a contributing_factor string.

3622496

SELECT * FROM crashes_all_prod WHERE socrata_id = 3622496

Resulting carto_db IDs: 1973109 and 1713114

Same as above: records are identical except for vehicle_type being a contributing_factor string. (contributing_factor didn't have any spaces in any values, so quoting not applicable)

3622911

SELECT * FROM crashes_all_prod WHERE socrata_id = 3622911

Resulting carto_db IDs: 1973046 and 1713055

Same as above: records are identical except for lower-numbered record quoting on contributing_factor and the vehicle_type being a contributing_factor string.

Conclusion

These duplicates appear to be accidents during some early runs of an ETL process. The vehicle_type field was loading incorrectly, and quoting behavior had changed in regards to the contributing_factor field.

These do count as crashes during a date period and at a geographic location, so are effectively double-counting these crashes. This is 398 crashes out of 1.2 million, far less inaccuracy than would be expected from simply crash participants not reporting, so this one factor may not be worth solving.
If we do want to solve it, I would need to put in some effort into figuring out which record of each pair should be kept. a) I presume that the one with the bogus vehicle_type is the worse-quality record? b) That's only a minute or two apiece to paste into CARTO and inspect them, but there are 398 of them.
The inclusion of invalid values and varying values in the vehicle_type field may have implications for categorizations or filtering? I am seeing all sorts of values in there: contractions and letter codes, contributing factor text, half-spelled words, ... If the vehicle_type is not in fact used for anything, then perhaps this could be ignored? If we do want to pursue this, let's file a new issue.

gregallensworth commented 6 years ago

Per issue #12 I have been asked to go ahead and clean these up. The 398 crashes and 85 injuries represented in these records, are all concentrated on 2017-02-26 and therefore throw off the statistics significantly for this month.

This query will list the duplicate records, and the sorting effectively "pairs" the two copies next to each other for easier review. Go through the pairs, and figure out if we can easily delete 398 of these 796 records... or need a more complicated process to cherry-pick the best data from each.

SELECT * FROM crashes_all_prod WHERE socrata_id IN (
    SELECT socrata_id FROM (
      SELECT socrata_id, COUNT(*) AS howmany FROM crashes_all_prod GROUP BY socrata_id
      ) chex
    WHERE howmany > 1
) ORDER BY socrata_id, cartodb_id

This one gives fewer fields, for easier browsing:

SELECT
socrata_id, cartodb_id, the_geom, neighborhood, contributing_factor, vehicle_type
FROM crashes_all_prod WHERE socrata_id IN (
    SELECT socrata_id FROM (
      SELECT socrata_id, COUNT(*) AS howmany FROM crashes_all_prod GROUP BY socrata_id
      ) chex
    WHERE howmany > 1
) ORDER BY socrata_id, cartodb_id

gregallensworth commented 6 years ago

Going through these visually as pairs, it's going swiftly, and does corroborate the suspicion that there were changes happening to the ETL script which were loading records incorrectly, followed by loading the correctly later... but without clearing out the bad records first.

The pairs follow the same pattern:

A pair of records, with one in the 1713000 and the other in the 1973000 range
The 1713000 one has weird quotes on the contributing_factor field and also, has this same field value in the vehicle_type field
The 1973000 one has vehicle data in the vehicle field, and proper quoting on the contributing field
Thus, the lower-numbered ones have been the ones to drop and the higher-numbered have been the ones to keep

gregallensworth commented 6 years ago

Duplicate socrata__id records have been cleaned up, by manually examining and deleting, as described above. This should improve the variance for 2017-02 somewhat.

gregallensworth commented 6 years ago

Null Socrata IDs

I examined the null records. They are all within the date range of 2011-08-01 through 2012-06-01

It seems likely that their socrata_id field is null as they were not from SODA at all, but some other API, spreadsheet, geodatabase, ... Chris had mentioned that older records were loaded through different means than the more modern data.

If SODA has data for this date range at all, expect significant variances from what is loaded into CARTO.

gregallensworth commented 6 years ago

Closing. These two anomalies have been corrected, or else explained satisfactorily.

GreenInfo-Network / nyc-crash-mapper-etl-script