New Vehicle Type crosswalk and allocations

danrademacher commented 4 years ago

Warning that this is not simple! But I hope I've lined it up in a way that's reasonably well thought out.

This issue is the first step in a series of changes we'll be making to allow for filtering and inspection of crash data based on "vehicle type". This issue is confined to the ETL needs. We'll have separate issues for the actual filter and display requirements.

The data as given from NYC open data (as shown here on their Socrata site) is very messy, including almost a thousand variants, including a lot of misspellings and partials. The API source data includes up to 5 "vehicle type" fields per crash, to account for multivehicle crashes, though CHEKPEDS has reviewed the data and 99%+ crashes are accounted for by the first three

We currently store the data in a JSON array in a single field, vehicle_type. The goal here is to store cleaned up domain values for vehicle type and also do some new allocation of injuries and fatalities to account for some special cases, like bikes. This issue describes this as fully as I can, and then we can see if anything is missing:

From 900+ variants to 8 values.

First, we need to translate those messy values into a set of canonical values:

E-BIKE-SCOOT
MOTORCYCLE-MOPED
SUV
CAR
OTHER
TRUCK
BICYCLE
BUS-VAN

Using this crosswalk provided by CHEKPEDS: https://chekpeds.carto.com/tables/vehicletype_crosswalk_prod/table

Where we look up nyc_vehicletype and write into a new field the matching value from crashmapper_vehicletype, or fall back to OTHER if nothing is found. The goal is to keep this crosswalk as a table in CARTO so that as new weird values turn up in the source data, we can update the crosswalk using CARTO's existing web UI and then translate those new values into one of our eight domain values. We really need only distinct types here. We won't use crashmapper_vehicletype to count the number of total vehicles involved, so a CAR,CAR,CAR would be the same as CAR

(Note that I assume any changes to the 8 domain values would require code changes in filtering elements we implement in future tasks in other repos.)

UI will be a separate task, but this first step should enable a filter like this:

Pending the next section of this issue, I'm pausing on a determination whether the translated vehicle types should be stored as an array or in some other fashion. I do think we should retain the existing vehicle_type array as is, to make this translation nondestructive.

Portioning out the injured and the killed

The above translation would be sufficient to allow for basic filtering of crashes by involved vehicles -- "Show me crashes involving an SUV or CAR or TRUCK" like the push button Crash Type filters we have on the current application.

There's a further wrinkle in that we'll want to also know what proportion of injuries or fatalities were attributable to each vehicle type, but without double or triple-counting. If we simply add up all fatalities, say, where CAR and SUV are involved, a crash of an SUV and car with a single fatality could create 1+1, when what we want to do is say that they were each equally responsible for a single fatality, so 0.5+0.5. With the further challenge that a BICYCLE is the cause of a fatality only when no other vehicles are involved.

So I think what we would need to do here is:

get the count of the number of distinct crashmapper_vehicletypes involved
if that's >1, check for bike, if bike=true, then subtract 1 from total vehicle types
divide the number of fatalities and injuries by the number from 2.

But then it gets even more gnarly, in that we also need to retain whether the injured or killed were walking, biking, or driving.

Though this makes for a large schema, seems like a clear option would be to create a set of 48 (!) fields that directly allocate to the 8 canonical types:

e-bike-scoot boolean?
- e-bike-scoot_pedInjury
- e-bike-scoot_bikeInjury
- e-bike-scoot_motoristInjury
- e-bike-scoot_pedFatality
- e-bike-scoot_bikeFatality
- e-bike-scoot_motoristFatality

... and so on for all 8 crashmapper_vehicletypes

Then we could always query the same fields for the same mode-related injury/death data, and rows without that data simply wouldn't contribute to the total.

Visualizing that data will be a challenge, but first we have to get the data (and I do have a plan for visualization within current application structure).

I suspect we should work with a small subset of data, like the past month or two, before we try to run this against the whole dataset, which is going to its own challenge.

gregallensworth commented 4 years ago

A little early R&D of some bits and pieces.

-- creation of the target boolean columns

ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_scooter    BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_suv        BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_car        BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_other      BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_truck      BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_motorcycle BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_bicycle    BOOLEAN NOT NULL DEFAULT false;
ALTER TABLE crashes_all_prod ADD COLUMN hasvehicle_busvan     BOOLEAN NOT NULL DEFAULT false;

-- aggregate all NYC vehicle names fitting a given CrashMapper vehicle name
-- and find all crashes where the vehicle_type[] contains any of those values
-- running time is under 2 seconds

SELECT * FROM crashes_all_prod
WHERE
vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'MOTORCYCLE-MOPED')

-- that array aggregation could be turned into an UPDATE query, like this...

UPDATE crashes_all_prod
SET hasvehicle_scooter=true
WHERE
vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'E-BIKE-SCOOT')

--- then read back the results

SELECT hasvehicle_scooter, COUNT(*) FROM crashes_all_prod GROUP BY hasvehicle_scooter

gregallensworth commented 4 years ago

The full set of updates to define the boolean fields, would be as follows.

Problem is, CARTO is using a 5 second query timeout. The first one for hasvehicle_scooter runs in 3 seconds, while the hasvehicle_suv query times out at 5 seconds. I see in comments that we previously had a timeout of 15 seconds but Dan says that there was some confusion in January about chekpeds' grant status, same as each year...

UPDATE crashes_all_prod
SET hasvehicle_scooter=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'E-BIKE-SCOOT')

UPDATE crashes_all_prod
SET hasvehicle_suv=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'SUV')

UPDATE crashes_all_prod
SET hasvehicle_car=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'CAR')

UPDATE crashes_all_prod
SET hasvehicle_other=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'OTHER')

UPDATE crashes_all_prod
SET hasvehicle_truck=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'TRUCK')

UPDATE crashes_all_prod
SET hasvehicle_motorcycle=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'MOTORCYCLE-MOPED')

UPDATE crashes_all_prod
SET hasvehicle_bicycle=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'BICYCLE')

UPDATE crashes_all_prod
SET hasvehicle_busvan=true
WHERE vehicle_type && (SELECT ARRAY_AGG(nyc_vehicletype) FROM vehicletype_crosswalk_prod WHERE crashmapper_vehicletype = 'BUS-VAN')

gregallensworth commented 4 years ago

Confirmed that the API used by the ETL script also has a 5-second timeout. Also confirmed in my code comments, that the timeout was previously 15 seconds. This doesn't give me a lot of confidence in the data scripts continuing to work as the data volume increases at the end of January (New Years saw a significant lowdown in the volume of new crash data at Socrata).

To make this work, the new update_hasvehicle() functions in the ETL script break up each of the 8 vehicle types into 100 slices so as to make the queries run in under 5 seconds. (and on their first run when all records are false and are examined, the runs are 4+ seconds!)

However, the Batch Query API seems better suited to these later updates, of borough and precinct and neighborhood, of has-car and has-SUV, and so on. These do not need to be completed in any specific sequence, and delays of minutes would be acceptable. The hasvehicle_truck query runs in about 30-45 seconds for example, a lot faster than 100 3-second pieces. As such, I think it best to refactor these pieces of the ETL script to use the Batch Query API.

gregallensworth commented 4 years ago

All set:

The crashes_prod_all table now has these 8 new columns:

hasvehicle_bicycle
hasvehicle_busvan
hasvehicle_car
hasvehicle_motorcycle
hasvehicle_other
hasvehicle_scooter
hasvehicle_suv
hasvehicle_truck

The ETL script has been updated to update these.

The ETL script has also been refactored to use CARTO's Batch Query API, so the longer-running asynchronous updates such as neighborhood and hasvehicle, no longer need to be broken into several hundred small steps to run at 5 seconds apiece.

danrademacher commented 4 years ago

As noted in https://github.com/GreenInfo-Network/nyc-crash-mapper/issues/104, some 23k crashes don’t fit any of these filters.

I think best way to resolve this would be for hasvehicle_other to be TRUE if either there’s a Match with values explicitly labeled OTHER or no match at all. That way Other becomes a catch all for anything uncategorized.

gregallensworth commented 4 years ago

Noted that there are 23,041 crashes which have all 8 vehicle types set to false. These records would implicitly be left out of any vehicle filter, even one with all 8 vehicle types selected.

SELECT COUNT(*)
FROM crashes_all_prod
WHERE hasvehicle_car is false
AND hasvehicle_truck is false
AND hasvehicle_motorcycle is false
AND hasvehicle_bicycle is false
AND hasvehicle_suv is false
AND hasvehicle_busvan is false
AND hasvehicle_scooter is false
AND hasvehicle_other is false
AND vehicle_type::text != '{}'

In virtually all of these cases, the vehicle_type really is empty. So it's correct that this fits none.

In about 700 cases, the data are badly formatted, and multiple vehicle types appear in a single field. That is, instead of listing TAXI and BICYCLE in field 1 and field 2, the single value "TAXI,BICYCLE" is in the data, which of course does not match these domain values.

-- an array of two vehicle types
{"FIRE TRUCK","PASSENGER VEHICLE"}

-- one field contained this "vehicle type" which happens to contain a comma
{"FIRE TRUCK,PASSENGER VEHICLE"}

Handling this, could mean further patches to the ETL script, likely in format_string_for_postgres_array() where the vehicle_type_code field is handled and formatted. Of course, this presumes that there aren't any real values that should contain commas such that splitting would be troublesome, e.g. "Bicycle, electric" would have a very different meaning if split into two separate vehicle types.

gregallensworth commented 4 years ago

Future Incoming Comma-Joined Vehicle Types

[x] Adjust ETL script so as to split on comma so "TAXI,TAXI" would become an array of two taxis. Thus, forthcoming records with multiple types in one field will be saved properly and tagged accordingly.

Existing Comma-Joined Vehicle Types

[ ] Find the set of all existing vehicle_type strings which have these broken entries, and see about rewriting their vehicle_type field to correct the data, and then re-tag them with their corrected vehicle_type content.

Something like this should work to find them:

SELECT
DISTINCT vehicle_type::text
FROM crashes_all_prod
WHERE hasvehicle_car is false
AND hasvehicle_truck is false
AND hasvehicle_motorcycle is false
AND hasvehicle_bicycle is false
AND hasvehicle_suv is false
AND hasvehicle_busvan is false
AND hasvehicle_scooter is false
AND hasvehicle_other is false
AND vehicle_type::text != '{}'

Then like this to fix them:

UPDATE crashes_all_prod SET
vehicle_type='{"BICYCLE","PASSENGER VEHICLE"}',
hasvehicle_car=NULL, hasvehicle_truck=NULL, hasvehicle_motorcycle=NULL, hasvehicle_bicycle=NULL, hasvehicle_suv=NULL, hasvehicle_busvan=NULL, hasvehicle_scooter=NULL, hasvehicle_other=NULL
WHERE vehicle_type::text = '{"BICYCLE,PASSENGER VEHICLE"}';

gregallensworth commented 4 years ago

Cleanup done.

Only 4 records have no for all 8 vehicle types, but do have vehicle_type content. They have some really wonky data values so it's unsurprising that they don't match. Those would be candidates for a new vehicle-type review tool per #23

The other 22,400 records truly have no vehicle_type content, so their all-no is correct.

gregallensworth commented 4 years ago

Per discussion: These 22,500 records with nothing specified, should probably not tag them as hasvehicle_other since that is a stated category for other types of vehicles (ice cream trucks, a wheelbarrow, whatever) and is a different statement than "no vehicle types specified"

However, in https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/issues/108 the button's behavior could be changed to "Other / Unspecified" which would check hasvehicle_other = true OR no to the other 7 This would effectively "fill in the blank" of unspecified vehicles, but would leave the underlying data intact for later evaluation or treatment.

GreenInfo-Network / nyc-crash-mapper-etl-script