GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Vehicle Type apportionment #24

Closed danrademacher closed 4 years ago

danrademacher commented 4 years ago

Over in #22 we assigned booleans for the 8 canonical Vehicle Types.

Another part of that task, which is properly a separate issue, is this:

Portioning out the injured and the killed

22 is be sufficient to allow for basic filtering of crashes by involved vehicles -- "Show me crashes involving an SUV or CAR or TRUCK" like the push button Crash Type filters we have on the current application.

There's a further wrinkle in that we'll want to also know what proportion of injuries or fatalities were attributable to each vehicle type, but without double or triple-counting. If we simply add up all fatalities, say, where CAR and SUV are involved, a crash of an SUV and car with a single fatality could create 1+1, when what we want to do is say that they were each equally responsible for a single fatality, so 0.5+0.5. With the further challenge that a BICYCLE is the cause of a fatality only when no other vehicles are involved.

So I think what we would need to do here is:

  1. get the count of the number of distinct crashmapper_vehicletypes involved
  2. if that's >1, check for bike, if bike=true, then subtract 1 from total vehicle types
  3. divide the number of fatalities and injuries by the number from 2.

But then it gets even more gnarly, in that we also need to retain whether the injured or killed were walking, biking, or driving.

Though this makes for a large schema, seems like a clear option would be to create a set of 48 (!) fields that directly allocate to the 8 canonical types:

... and so on for all 8 crashmapper_vehicletypes

Then we could always query the same fields for the same mode-related injury/death data, and rows without that data simply wouldn't contribute to the total.

Visualizing that data will be a challenge, but first we have to get the data (and I do have a plan for visualization within current application structure).

danrademacher commented 4 years ago

OK, "progress." Insanely, this SQL gets is the data we think we want, just for 2019, to test in this Google Sheet. The Sheet includes some minor QA around making sure the fractional data add up to the un-split totals for peds, cyclists, motorists killed or injured. At first, they did not, by as much as 10%, and that turned out to be because a lot of crashes have no value for any vehicle type, so I went back and lumped those in with _other and then renamed those _other_unspecified, and now everything lines up precisely.

Which is different from saying this is accurate to real life. I am pretty confident it is internally consistent, which is the first hurdle. Now, off to the client for her consideration.

If this survives scrutiny, I am sure we'll have to find ways to make the SQL more sane, since I wrote it! But it seemed impossible to specify outside of iteratively testing each part of the query until it came out about right. Now we can get the allocations Christine's been looking for all with SQL, preview in Sheets change it if we need it, then figure out if that leads to changes upstream.

Note that unlike other things I have been previewing in Sheets, this is not a live call to the CARTO API. This is simply too long to encode in a URL. Again, yet another reason the query will have to change, but still I think worth having this giant hairball of a query to show the draft results before we start trying overhauling ETL or the user facing applications.

-- final working query!

SELECT 
--FOURTH Query aggregates into monthly buckets
    year_month,
    SUM(d.cyclist_injured) as cyclist_injured,
    SUM(d.cyclist_killed) as cyclist_killed,
    SUM(d.motorist_injured) as motorist_injured,
    SUM(d.motorist_killed) as motorist_killed,
    SUM(d.pedestrian_injured) as pedestrian_injured,
    SUM(d.pedestrian_killed) as pedestrian_killed,
    SUM(d.persons_injured) as persons_injured,
    SUM(d.persons_killed) as persons_killed,
    SUM(d.cyclist_injured_bybike) as cyclist_injured_bybike,
    SUM(d.cyclist_killed_bybike) as cyclist_killed_bybike,
    SUM(d.motorist_injured_bybike) as motorist_injured_bybike,
    SUM(d.motorist_killed_bybike) as motorist_killed_bybike,
    SUM(d.pedestrian_injured_bybike) as pedestrian_injured_bybike,
    SUM(d.pedestrian_killed_bybike) as pedestrian_killed_bybike,
    SUM(d.persons_injured_bybike) as persons_injured_bybike,
    SUM(d.persons_killed_bybike) as persons_killed_bybike,
    SUM(d.cyclist_injured_byscooter) as cyclist_injured_byscooter,
    SUM(d.cyclist_killed_byscooter) as cyclist_killed_byscooter,
    SUM(d.motorist_injured_byscooter) as motorist_injured_byscooter,
    SUM(d.motorist_killed_byscooter) as motorist_killed_byscooter,
    SUM(d.pedestrian_injured_byscooter) as pedestrian_injured_byscooter,
    SUM(d.pedestrian_killed_byscooter) as pedestrian_killed_byscooter,
    SUM(d.persons_injured_byscooter) as persons_injured_byscooter,
    SUM(d.persons_killed_byscooter) as persons_killed_byscooter,
    SUM(d.cyclist_injured_bymotorcycle) as cyclist_injured_bymotorcycle,
    SUM(d.cyclist_killed_bymotorcycle) as cyclist_killed_bymotorcycle,
    SUM(d.motorist_injured_bymotorcycle) as motorist_injured_bymotorcycle,
    SUM(d.motorist_killed_bymotorcycle) as motorist_killed_bymotorcycle,
    SUM(d.pedestrian_injured_bymotorcycle) as pedestrian_injured_bymotorcycle,
    SUM(d.pedestrian_killed_bymotorcycle) as pedestrian_killed_bymotorcycle,
    SUM(d.persons_injured_bymotorcycle) as persons_injured_bymotorcycle,
    SUM(d.persons_killed_bymotorcycle) as persons_killed_bymotorcycle,
    SUM(d.cyclist_injured_bybusvan) as cyclist_injured_bybusvan,
    SUM(d.cyclist_killed_bybusvan) as cyclist_killed_bybusvan,
    SUM(d.motorist_injured_bybusvan) as motorist_injured_bybusvan,
    SUM(d.motorist_killed_bybusvan) as motorist_killed_bybusvan,
    SUM(d.pedestrian_injured_bybusvan) as pedestrian_injured_bybusvan,
    SUM(d.pedestrian_killed_bybusvan) as pedestrian_killed_bybusvan,
    SUM(d.persons_injured_bybusvan) as persons_injured_bybusvan,
    SUM(d.persons_killed_bybusvan) as persons_killed_bybusvan,
    SUM(d.cyclist_injured_bycar) as cyclist_injured_bycar,
    SUM(d.cyclist_killed_bycar) as cyclist_killed_bycar,
    SUM(d.motorist_injured_bycar) as motorist_injured_bycar,
    SUM(d.motorist_killed_bycar) as motorist_killed_bycar,
    SUM(d.pedestrian_injured_bycar) as pedestrian_injured_bycar,
    SUM(d.pedestrian_killed_bycar) as pedestrian_killed_bycar,
    SUM(d.persons_injured_bycar) as persons_injured_bycar,
    SUM(d.persons_killed_bycar) as persons_killed_bycar,
    SUM(d.cyclist_injured_bysuv) as cyclist_injured_bysuv,
    SUM(d.cyclist_killed_bysuv) as cyclist_killed_bysuv,
    SUM(d.motorist_injured_bysuv) as motorist_injured_bysuv,
    SUM(d.motorist_killed_bysuv) as motorist_killed_bysuv,
    SUM(d.pedestrian_injured_bysuv) as pedestrian_injured_bysuv,
    SUM(d.pedestrian_killed_bysuv) as pedestrian_killed_bysuv,
    SUM(d.persons_injured_bysuv) as persons_injured_bysuv,
    SUM(d.persons_killed_bysuv) as persons_killed_bysuv,
    SUM(d.cyclist_injured_bytruck) as cyclist_injured_bytruck,
    SUM(d.cyclist_killed_bytruck) as cyclist_killed_bytruck,
    SUM(d.motorist_injured_bytruck) as motorist_injured_bytruck,
    SUM(d.motorist_killed_bytruck) as motorist_killed_bytruck,
    SUM(d.pedestrian_injured_bytruck) as pedestrian_injured_bytruck,
    SUM(d.pedestrian_killed_bytruck) as pedestrian_killed_bytruck,
    SUM(d.persons_injured_bytruck) as persons_injured_bytruck,
    SUM(d.persons_killed_bytruck) as persons_killed_bytruck,
    SUM(d.cyclist_injured_byother) as cyclist_injured_byother,
    SUM(d.cyclist_killed_byother) as cyclist_killed_byother,
    SUM(d.motorist_injured_byother) as motorist_injured_byother,
    SUM(d.motorist_killed_byother) as motorist_killed_byother,
    SUM(d.pedestrian_injured_byother) as pedestrian_injured_byother,
    SUM(d.pedestrian_killed_byother) as pedestrian_killed_byother,
    SUM(d.persons_injured_byother) as persons_injured_byother,
    SUM(d.persons_killed_byother) as persons_killed_byother
FROM
--THIRD Query multiplies out all the blame allocation for each assigned vehicle type
(SELECT 
    year_month,
    cyclist_injured,
    cyclist_killed,
    motorist_injured,
    motorist_killed,
    pedestrian_injured,
    pedestrian_killed,
    persons_injured,
    persons_killed,
    hasvehicle_bicycle,
    hasvehicle_scooter,
    hasvehicle_motorcycle,
    hasvehicle_busvan,
    hasvehicle_car,
    hasvehicle_suv,
    hasvehicle_truck,
    hasvehicle_other_unspecified,
    blame_factor,
    bike_blame,
    --now all the different fields, for every mode
    --hasvehicle_bicycle
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bybike,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bybike,
    --hasvehicle_scooter
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_byscooter,
    CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_byscooter,
    --hasvehicle_motorcycle
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bymotorcycle,
    CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bymotorcycle,
    --hasvehicle_busvan
    CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bybusvan,
    CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bybusvan,
    --hasvehicle_car
    CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bycar,
    CASE WHEN (hasvehicle_car is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bycar,
    --hasvehicle_suv
    CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bysuv,
    CASE WHEN (hasvehicle_suv is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bysuv,
    --hasvehicle_truck
    CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_bytruck,
    CASE WHEN (hasvehicle_truck is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_bytruck,    
    --hasvehicle_other_unspecified
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_injured_allocated ELSE 0 END AS cyclist_injured_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_killed_allocated ELSE 0 END AS cyclist_killed_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_injured_allocated ELSE 0 END AS motorist_injured_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_killed_allocated ELSE 0 END AS motorist_killed_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_injured_allocated ELSE 0 END AS pedestrian_injured_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_killed_allocated ELSE 0 END AS pedestrian_killed_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_injured_allocated ELSE 0 END AS persons_injured_byother,
    CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_killed_allocated ELSE 0 END AS persons_killed_byother
FROM
--SECOND QUERY pulls in a bunch of pass through but also multiples the allocation amounts
( SELECT
    --these are totals, unallocated
    cyclist_injured,
    cyclist_killed,
    motorist_injured,
    motorist_killed,
    pedestrian_injured,
    pedestrian_killed,
    persons_injured,
    persons_killed,
    hasvehicle_bicycle,
    hasvehicle_scooter,
    hasvehicle_motorcycle,
    hasvehicle_busvan,
    hasvehicle_car,
    hasvehicle_suv,
    hasvehicle_truck,
    hasvehicle_other_unspecified,
    blame_factor,
    bike_blame,
    --fractional for allocation
    (blame_factor * cyclist_injured) as cyclist_injured_allocated,
    (blame_factor * cyclist_killed) as cyclist_killed_allocated,
    (blame_factor * motorist_injured) as motorist_injured_allocated,
    (blame_factor * motorist_killed) as motorist_killed_allocated,
    (blame_factor * pedestrian_injured) as pedestrian_injured_allocated,
    (blame_factor * pedestrian_killed) as pedestrian_killed_allocated,
    (blame_factor * persons_injured) as persons_injured_allocated,
    (blame_factor * persons_killed) as persons_killed_allocated,

    year_month
FROM
    --BASE QUERY gets the core data and also sets a "blame factor" based on involved vehicle types
    (SELECT
        cartodb_id,
        number_of_cyclist_injured as cyclist_injured,
        number_of_cyclist_killed as cyclist_killed,
        number_of_motorist_injured as motorist_injured,
        number_of_motorist_killed as motorist_killed,
        number_of_pedestrian_injured as pedestrian_injured,
        number_of_pedestrian_killed as pedestrian_killed,
        (number_of_pedestrian_injured + number_of_cyclist_injured + number_of_motorist_injured) as persons_injured,
        (number_of_pedestrian_killed + number_of_cyclist_killed + number_of_motorist_killed) as persons_killed,
        hasvehicle_bicycle,
        hasvehicle_scooter,
        hasvehicle_motorcycle,
        hasvehicle_busvan,
        hasvehicle_car,
        hasvehicle_suv,
        hasvehicle_truck,
        --set other to TRUE if nothing else is selected, which catches crashes with no vtype data
        CASE
            WHEN 
                (hasvehicle_bicycle::int + hasvehicle_motorcycle::int + hasvehicle_scooter::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN 
                TRUE
            ELSE hasvehicle_other 
        END as hasvehicle_other_unspecified,
        -- Determine if this record will "blame" bikes or scooters for injuries or deaths, only in cases with no other motor vehicles
        CASE
            WHEN
                (hasvehicle_bicycle    OR hasvehicle_scooter) AND (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN
                TRUE
            ELSE
                FALSE
        END as bike_blame,
        --determine the number of blameable vehicles involved and then turn to percentage blame. Run as suqquery to it can be used for math
        CASE
            WHEN
                (hasvehicle_bicycle OR hasvehicle_scooter) AND (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN
                (1/CAST (NULLIF((hasvehicle_bicycle::int + hasvehicle_scooter::int),0) as FLOAT))
            ELSE
                (1/CAST (
                        NULLIF(
                            (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int)
                            + 
                            (CASE 
                                WHEN (hasvehicle_bicycle::int + hasvehicle_motorcycle::int + hasvehicle_scooter::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0) 
                                THEN 
                                    1 
                                ELSE 
                                0 END
                            ),0) as FLOAT))
            END
        as blame_factor,
        year || '-' || LPAD(month::text, 2, '0') as year_month

        from crashes_all_prod as c
        WHERE year = 2019
    ) as a
) as b) as d
GROUP BY year_month
danrademacher commented 4 years ago

One critique of this approach is that this is a TYPE apportionment, so we can imagine situations where, say, 4 cars and a truck cause 5 deaths. By the above, CAR gets hit with 2.5 and TRUCK the same. Is that right or should it be CAR 4 and TRUCK 1?

Decided to see how often this is actually a problem. The query below will select all 2019 Crashes where bike_blame is false (so a motor vehicle was involved) and the number of Motor Vehicle Types is not equal to the raw number of vehicles.

Result: 686 crashes, out of 211,038 in 2019. 0.3% of crashes.

Ran this for the whole dataset as well, 7,638 crashes out of 1,744,626. Similar: 0.4% of crashes. That's less error than we're going to have just from weird data.

SELECT
    count(cartodb_id)
FROM
    (SELECT 
        cartodb_id, 
        count(distinct(vtype_unnest)) as vehiclecount,
        typecount
    FROM
        (select * from 
            (SELECT 
                cartodb_id,
                unnest(vehicle_type) as vtype_unnest,
                (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int) as typecount,
                CASE
                WHEN
                    (hasvehicle_bicycle OR hasvehicle_scooter) AND (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
                THEN
                    TRUE
                ELSE
                    FALSE
                END as bike_blame
            FROM crashes_all_prod
            where year = 2019) as a 
            -- get only crashes where bikes are not blamed
        where bike_blame is false) as b
    GROUP BY cartodb_id,typecount) as c
where 
    --Single vehicle crashes don't matter, and multivehicle crashes of the same type don't matter. So show me records with more than 1 vehicle, more than 1 type and a differential between raw vehicle number of type count
    vehiclecount>1 and 
    typecount > 1 and
    vehiclecount != typecount
danrademacher commented 4 years ago

New table to test some schema changes so we can then test how we might bake the above calcs into the database with repeatable queries: https://chekpeds.carto.com/tables/crashes_all_prod_vtype_tests

Then we can use the table for dynamic test of this allocation across longer time periods and different arrays of vehicle types

danrademacher commented 4 years ago

We need to create and populate these fields at the crash level using the logic already captured in the above nested query:

cyclist_injured_bybike
cyclist_killed_bybike
motorist_injured_bybike
motorist_killed_bybike
pedestrian_injured_bybike
pedestrian_killed_bybike
persons_injured_bybike
persons_killed_bybike
cyclist_injured_byscooter
cyclist_killed_byscooter
motorist_injured_byscooter
motorist_killed_byscooter
pedestrian_injured_byscooter
pedestrian_killed_byscooter
persons_injured_byscooter
persons_killed_byscooter
cyclist_injured_bymotorcycle
cyclist_killed_bymotorcycle
motorist_injured_bymotorcycle
motorist_killed_bymotorcycle
pedestrian_injured_bymotorcycle
pedestrian_killed_bymotorcycle
persons_injured_bymotorcycle
persons_killed_bymotorcycle
cyclist_injured_bybusvan
cyclist_killed_bybusvan
motorist_injured_bybusvan
motorist_killed_bybusvan
pedestrian_injured_bybusvan
pedestrian_killed_bybusvan
persons_injured_bybusvan
persons_killed_bybusvan
cyclist_injured_bycar
cyclist_killed_bycar
motorist_injured_bycar
motorist_killed_bycar
pedestrian_injured_bycar
pedestrian_killed_bycar
persons_injured_bycar
persons_killed_bycar
cyclist_injured_bysuv
cyclist_killed_bysuv
motorist_injured_bysuv
motorist_killed_bysuv
pedestrian_injured_bysuv
pedestrian_killed_bysuv
persons_injured_bysuv
persons_killed_bysuv
cyclist_injured_bytruck
cyclist_killed_bytruck
motorist_injured_bytruck
motorist_killed_bytruck
pedestrian_injured_bytruck
pedestrian_killed_bytruck
persons_injured_bytruck
persons_killed_bytruck
cyclist_injured_byother
cyclist_killed_byother
motorist_injured_byother
motorist_killed_byother
pedestrian_injured_byother
pedestrian_killed_byother
persons_injured_byother

Then we could run the monthly visualization summaries against that. Need to make sure other is clearly other_unspecified

danrademacher commented 4 years ago

Ran this in CARTO GUI to add the fields

ALTER TABLE crashes_all_prod_vtype_tests
    ADD COLUMN cyclist_injured_bybike decimal,
    ADD COLUMN cyclist_killed_bybike decimal,
    ADD COLUMN motorist_injured_bybike decimal,
    ADD COLUMN motorist_killed_bybike decimal,
    ADD COLUMN pedestrian_injured_bybike decimal,
    ADD COLUMN pedestrian_killed_bybike decimal,
    ADD COLUMN persons_injured_bybike decimal,
    ADD COLUMN persons_killed_bybike decimal,
    ADD COLUMN cyclist_injured_byscooter decimal,
    ADD COLUMN cyclist_killed_byscooter decimal,
    ADD COLUMN motorist_injured_byscooter decimal,
    ADD COLUMN motorist_killed_byscooter decimal,
    ADD COLUMN pedestrian_injured_byscooter decimal,
    ADD COLUMN pedestrian_killed_byscooter decimal,
    ADD COLUMN persons_injured_byscooter decimal,
    ADD COLUMN persons_killed_byscooter decimal,
    ADD COLUMN cyclist_injured_bymotorcycle decimal,
    ADD COLUMN cyclist_killed_bymotorcycle decimal,
    ADD COLUMN motorist_injured_bymotorcycle decimal,
    ADD COLUMN motorist_killed_bymotorcycle decimal,
    ADD COLUMN pedestrian_injured_bymotorcycle decimal,
    ADD COLUMN pedestrian_killed_bymotorcycle decimal,
    ADD COLUMN persons_injured_bymotorcycle decimal,
    ADD COLUMN persons_killed_bymotorcycle decimal,
    ADD COLUMN cyclist_injured_bybusvan decimal,
    ADD COLUMN cyclist_killed_bybusvan decimal,
    ADD COLUMN motorist_injured_bybusvan decimal,
    ADD COLUMN motorist_killed_bybusvan decimal,
    ADD COLUMN pedestrian_injured_bybusvan decimal,
    ADD COLUMN pedestrian_killed_bybusvan decimal,
    ADD COLUMN persons_injured_bybusvan decimal,
    ADD COLUMN persons_killed_bybusvan decimal,
    ADD COLUMN cyclist_injured_bycar decimal,
    ADD COLUMN cyclist_killed_bycar decimal,
    ADD COLUMN motorist_injured_bycar decimal,
    ADD COLUMN motorist_killed_bycar decimal,
    ADD COLUMN pedestrian_injured_bycar decimal,
    ADD COLUMN pedestrian_killed_bycar decimal,
    ADD COLUMN persons_injured_bycar decimal,
    ADD COLUMN persons_killed_bycar decimal,
    ADD COLUMN cyclist_injured_bysuv decimal,
    ADD COLUMN cyclist_killed_bysuv decimal,
    ADD COLUMN motorist_injured_bysuv decimal,
    ADD COLUMN motorist_killed_bysuv decimal,
    ADD COLUMN pedestrian_injured_bysuv decimal,
    ADD COLUMN pedestrian_killed_bysuv decimal,
    ADD COLUMN persons_injured_bysuv decimal,
    ADD COLUMN persons_killed_bysuv decimal,
    ADD COLUMN cyclist_injured_bytruck decimal,
    ADD COLUMN cyclist_killed_bytruck decimal,
    ADD COLUMN motorist_injured_bytruck decimal,
    ADD COLUMN motorist_killed_bytruck decimal,
    ADD COLUMN pedestrian_injured_bytruck decimal,
    ADD COLUMN pedestrian_killed_bytruck decimal,
    ADD COLUMN persons_injured_bytruck decimal,
    ADD COLUMN persons_killed_bytruck decimal,
    ADD COLUMN cyclist_injured_byother decimal,
    ADD COLUMN cyclist_killed_byother decimal,
    ADD COLUMN motorist_injured_byother decimal,
    ADD COLUMN motorist_killed_byother decimal,
    ADD COLUMN pedestrian_injured_byother decimal,
    ADD COLUMN pedestrian_killed_byother decimal,
    ADD COLUMN persons_injured_byother decimal,
    ADD COLUMN persons_killed_byother  decimal;
danrademacher commented 4 years ago

Next add blame_factor and bike_blame and also hasvehicle_other_unspecified so we can catch unspecified allocations without constantly querying them out with conidtionals.

ALTER TABLE crashes_all_prod_vtype_tests
    ADD COLUMN hasvehicle_other_unspecified  boolean,
    ADD COLUMN blame_factor decimal,
    ADD COLUMN bike_blame boolean;

Setting their values for the whole dataset at once fails. So I did it for Jan 2019 through today:

UPDATE crashes_all_prod_vtype_tests
        --set other to TRUE if nothing else is selected, which catches crashes with no vtype data
        SET hasvehicle_other_unspecified = 
        CASE
            WHEN 
                (hasvehicle_bicycle::int + hasvehicle_motorcycle::int + hasvehicle_scooter::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN 
                TRUE
            ELSE hasvehicle_other 
        END,
        -- Determine if this record will "blame" bikes or scooters for injuries or deaths, only in cases with no other motor vehicles
        bike_blame = 
        CASE
            WHEN
                (hasvehicle_bicycle OR hasvehicle_scooter) AND (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN
                TRUE
            ELSE
                FALSE
        END,
        --determine the number of blameable vehicles involved and then turn to percentage blame. Run as suqquery to it can be used for math
        blame_factor = 
        CASE
            WHEN
                (hasvehicle_bicycle OR hasvehicle_scooter) AND (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0)
            THEN
                (1/CAST (NULLIF((hasvehicle_bicycle::int + hasvehicle_scooter::int),0) as FLOAT))
            ELSE
                (1/CAST (
                        NULLIF(
                            (hasvehicle_motorcycle::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int)
                            + 
                            (CASE 
                                WHEN (hasvehicle_bicycle::int + hasvehicle_motorcycle::int + hasvehicle_scooter::int + hasvehicle_busvan::int + hasvehicle_car::int + hasvehicle_suv::int + hasvehicle_truck::int + hasvehicle_other::int = 0) 
                                THEN 
                                    1 
                                ELSE 
                                0 END
                            ),0) as FLOAT))
        END
WHERE
date_val between '2019-01-01 00:00:00'::timestamp  and '2020-03-05 23:59:00'::timestamp;

We'd need to batch this to update the whole dataset

danrademacher commented 4 years ago

This should set all the allocations but getting a syntax error on SET:

--This Query multiplies out all the blame allocation for each assigned vehicle type
UPDATE crashes_all_prod_vtype_tests
    --now all the different fields, for every mode
    SET
    --hasvehicle_bicycle
    cyclist_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_scooter
    cyclist_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_motorcycle
    cyclist_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_busvan
    cyclist_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_car
    cyclist_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_suv
    cyclist_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN persons_killed_allocated ELSE 0 END,
    --hasvehicle_truck
    cyclist_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN persons_killed_allocated ELSE 0 END,    
    --hasvehicle_other_unspecified
    cyclist_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
    cyclist_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
    motorist_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_injured_allocated ELSE 0 END,
    motorist_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_killed_allocated ELSE 0 END,
    pedestrian_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
    pedestrian_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
    persons_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_injured_allocated ELSE 0 END,
    persons_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_killed_allocated ELSE 0 END

From
    (SELECT
        --these are totals, unallocated
        hasvehicle_bicycle,
        hasvehicle_scooter,
        hasvehicle_motorcycle,
        hasvehicle_busvan,
        hasvehicle_car,
        hasvehicle_suv,
        hasvehicle_truck,
        hasvehicle_other_unspecified,
        blame_factor,
        bike_blame,
        date_val,
        --fractional for allocation
        (blame_factor * number_of_cyclist_injured) as cyclist_injured_allocated,
        (blame_factor * number_of_cyclist_killed) as cyclist_killed_allocated,
        (blame_factor * number_of_motorist_injured) as motorist_injured_allocated,
        (blame_factor * number_of_motorist_killed) as motorist_killed_allocated,
        (blame_factor * number_of_pedestrian_injured) as pedestrian_injured_allocated,
        (blame_factor * number_of_pedestrian_killed) as pedestrian_killed_allocated,
        (blame_factor * (number_of_pedestrian_injured + number_of_cyclist_injured + number_of_motorist_injured) ) as persons_injured_allocated,
        (blame_factor * (number_of_pedestrian_killed + number_of_cyclist_killed + number_of_motorist_killed)) as persons_killed_allocated,
        --hasvehicle_bike
        cyclist_injured_bybike,
        cyclist_killed_bybike,
        motorist_injured_bybike,
        motorist_killed_bybike,
        pedestrian_injured_bybike,
        pedestrian_killed_bybike,
        persons_injured_bybike,
        persons_killed_bybike,
        --hasvehicle_scooter
        cyclist_injured_byscooter,
        cyclist_killed_byscooter,
        motorist_injured_byscooter,
        motorist_killed_byscooter,
        pedestrian_injured_byscooter,
        pedestrian_killed_byscooter,
        persons_injured_byscooter,
        persons_killed_byscooter,
        --hasvehicle_motorcycle
        cyclist_injured_bymotorcycle,
        cyclist_killed_bymotorcycle,
        motorist_injured_bymotorcycle,
        motorist_killed_bymotorcycle,
        pedestrian_injured_bymotorcycle,
        pedestrian_killed_bymotorcycle,
        persons_injured_bymotorcycle,
        persons_killed_bymotorcycle,
        --hasvehicle_busvan
        cyclist_injured_bybusvan,
        cyclist_killed_bybusvan,
        motorist_injured_bybusvan,
        motorist_killed_bybusvan,
        pedestrian_injured_bybusvan,
        pedestrian_killed_bybusvan,
        persons_injured_bybusvan,
        persons_killed_bybusvan,
        --hasvehicle_car
        cyclist_injured_bycar,
        cyclist_killed_bycar,
        motorist_injured_bycar,
        motorist_killed_bycar,
        pedestrian_injured_bycar,
        pedestrian_killed_bycar,
        persons_injured_bycar,
        persons_killed_bycar,
        --hasvehicle_suv
        cyclist_injured_bysuv,
        cyclist_killed_bysuv,
        motorist_injured_bysuv,
        motorist_killed_bysuv,
        pedestrian_injured_bysuv,
        pedestrian_killed_bysuv,
        persons_injured_bysuv,
        persons_killed_bysuv,
        --hasvehicle_truck
        cyclist_injured_bytruck,
        cyclist_killed_bytruck,
        motorist_injured_bytruck,
        motorist_killed_bytruck,
        pedestrian_injured_bytruck,
        pedestrian_killed_bytruck,
        persons_injured_bytruck,
        persons_killed_bytruck,
        --hasvehicle_other_unspecified
        cyclist_injured_byother,
        cyclist_killed_byother,
        motorist_injured_byother,
        motorist_killed_byother,
        pedestrian_injured_byother,
        pedestrian_killed_byother,
        persons_injured_byother,
        persons_killed_byother 
    FROM 
        crashes_all_prod_vtype_tests
    WHERE
        date_val between '2019-01-01 00:00:00'::timestamp  and '2020-03-05 23:59:00'::timestamp
    ) as a
danrademacher commented 4 years ago

OK, not sure why that's not working. For now, going to bake the subquery into the data. Not sure we'd want to do that on production.

ALTER TABLE crashes_all_prod_vtype_tests
    ADD COLUMN cyclist_injured_allocated DECIMAL,
    ADD COLUMN cyclist_killed_allocated DECIMAL,
    ADD COLUMN motorist_injured_allocated DECIMAL,
    ADD COLUMN motorist_killed_allocated DECIMAL,
    ADD COLUMN pedestrian_injured_allocated DECIMAL,
    ADD COLUMN pedestrian_killed_allocated DECIMAL,
    ADD COLUMN persons_injured_allocated DECIMAL,
    ADD COLUMN persons_killed_allocated DECIMAL;

and then populate those with the previous subquery

All records from Jan 2019 to now fails as over API limits:

UPDATE crashes_all_prod_vtype_tests
      SET
            cyclist_injured_allocated = (blame_factor * number_of_cyclist_injured),
            cyclist_killed_allocated = (blame_factor * number_of_cyclist_killed),
            motorist_injured_allocated = (blame_factor * number_of_motorist_injured),
            motorist_killed_allocated = (blame_factor * number_of_motorist_killed),
            pedestrian_injured_allocated = (blame_factor * number_of_pedestrian_injured),
            pedestrian_killed_allocated = (blame_factor * number_of_pedestrian_killed),
            persons_injured_allocated = (blame_factor * (number_of_pedestrian_injured + number_of_cyclist_injured + number_of_motorist_injured) ),
            persons_killed_allocated = (blame_factor * (number_of_pedestrian_killed + number_of_cyclist_killed + number_of_motorist_killed))
      WHERE
            date_val between '2019-01-01 00:00:00'::timestamp  and '2020-03-05 23:59:00'::timestamp

So for now, ran Jan-Mar 2020, then three 4 month increments to get the full time period. That worked.

            date_val between '2020-01-01 00:00:00'::timestamp  and '2020-03-05 23:59:00'::timestamp
            date_val between '2019-09-01 00:00:00'::timestamp  and '2019-12-31 23:59:00'::timestamp
            date_val between '2019-05-01 00:00:00'::timestamp  and '2019-8-31 23:59:00'::timestamp
            date_val between '2019-01-01 00:00:00'::timestamp  and '2019-4-30 23:59:00'::timestamp

Oy, and it looks like I could have done the subquery approach but many you can't put a comment line between the UPDATE and SET? Weird.

At any rate, this works without a comment there but fails with is

    --This Query multiplies out all the blame allocation for each assigned vehicle type
    UPDATE crashes_all_prod_vtype_tests
        SET
        --hasvehicle_bicycle
        cyclist_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bybike = CASE WHEN (bike_blame is TRUE AND hasvehicle_bicycle is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_scooter
        cyclist_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_byscooter = CASE WHEN (bike_blame is TRUE AND hasvehicle_scooter is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_motorcycle
        cyclist_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bymotorcycle = CASE WHEN (hasvehicle_motorcycle is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_busvan
        cyclist_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bybusvan = CASE WHEN (hasvehicle_busvan is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_car
        cyclist_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bycar = CASE WHEN (hasvehicle_car is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_suv
        cyclist_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bysuv = CASE WHEN (hasvehicle_suv is TRUE) THEN persons_killed_allocated ELSE 0 END,
        --hasvehicle_truck
        cyclist_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_bytruck = CASE WHEN (hasvehicle_truck is TRUE) THEN persons_killed_allocated ELSE 0 END,    
        --hasvehicle_other_unspecified
        cyclist_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_injured_allocated ELSE 0 END,
        cyclist_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN cyclist_killed_allocated ELSE 0 END,
        motorist_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_injured_allocated ELSE 0 END,
        motorist_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN motorist_killed_allocated ELSE 0 END,
        pedestrian_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_injured_allocated ELSE 0 END,
        pedestrian_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN pedestrian_killed_allocated ELSE 0 END,
        persons_injured_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_injured_allocated ELSE 0 END,
        persons_killed_byother = CASE WHEN (hasvehicle_other_unspecified is TRUE) THEN persons_killed_allocated ELSE 0 END

      WHERE
            date_val between '2019-01-01 00:00:00'::timestamp  and '2019-4-30 23:59:00'::timestamp

Ran for same time buckets.

danrademacher commented 4 years ago

And then our working summary query we can finally use to dynamically test per mode calcs and see if this stuff adds up:

SELECT 
    SUM(number_of_cyclist_injured) as cyclist_injured,
    SUM(number_of_cyclist_killed) as cyclist_killed,
    SUM(number_of_motorist_injured) as motorist_injured,
    SUM(number_of_motorist_killed) as motorist_killed,
    SUM(number_of_pedestrian_injured) as pedestrian_injured,
    SUM(number_of_pedestrian_killed) as pedestrian_killed,
    SUM((number_of_pedestrian_injured + number_of_cyclist_injured + number_of_motorist_injured)) as persons_injured,
    SUM((number_of_pedestrian_killed + number_of_cyclist_killed + number_of_motorist_killed)) as persons_killed,
    SUM(cyclist_injured_bybike) as cyclist_injured_bybike,
    SUM(cyclist_killed_bybike) as cyclist_killed_bybike,
    SUM(motorist_injured_bybike) as motorist_injured_bybike,
    SUM(motorist_killed_bybike) as motorist_killed_bybike,
    SUM(pedestrian_injured_bybike) as pedestrian_injured_bybike,
    SUM(pedestrian_killed_bybike) as pedestrian_killed_bybike,
    SUM(persons_injured_bybike) as persons_injured_bybike,
    SUM(persons_killed_bybike) as persons_killed_bybike,
    SUM(cyclist_injured_byscooter) as cyclist_injured_byscooter,
    SUM(cyclist_killed_byscooter) as cyclist_killed_byscooter,
    SUM(motorist_injured_byscooter) as motorist_injured_byscooter,
    SUM(motorist_killed_byscooter) as motorist_killed_byscooter,
    SUM(pedestrian_injured_byscooter) as pedestrian_injured_byscooter,
    SUM(pedestrian_killed_byscooter) as pedestrian_killed_byscooter,
    SUM(persons_injured_byscooter) as persons_injured_byscooter,
    SUM(persons_killed_byscooter) as persons_killed_byscooter,
    SUM(cyclist_injured_bymotorcycle) as cyclist_injured_bymotorcycle,
    SUM(cyclist_killed_bymotorcycle) as cyclist_killed_bymotorcycle,
    SUM(motorist_injured_bymotorcycle) as motorist_injured_bymotorcycle,
    SUM(motorist_killed_bymotorcycle) as motorist_killed_bymotorcycle,
    SUM(pedestrian_injured_bymotorcycle) as pedestrian_injured_bymotorcycle,
    SUM(pedestrian_killed_bymotorcycle) as pedestrian_killed_bymotorcycle,
    SUM(persons_injured_bymotorcycle) as persons_injured_bymotorcycle,
    SUM(persons_killed_bymotorcycle) as persons_killed_bymotorcycle,
    SUM(cyclist_injured_bybusvan) as cyclist_injured_bybusvan,
    SUM(cyclist_killed_bybusvan) as cyclist_killed_bybusvan,
    SUM(motorist_injured_bybusvan) as motorist_injured_bybusvan,
    SUM(motorist_killed_bybusvan) as motorist_killed_bybusvan,
    SUM(pedestrian_injured_bybusvan) as pedestrian_injured_bybusvan,
    SUM(pedestrian_killed_bybusvan) as pedestrian_killed_bybusvan,
    SUM(persons_injured_bybusvan) as persons_injured_bybusvan,
    SUM(persons_killed_bybusvan) as persons_killed_bybusvan,
    SUM(cyclist_injured_bycar) as cyclist_injured_bycar,
    SUM(cyclist_killed_bycar) as cyclist_killed_bycar,
    SUM(motorist_injured_bycar) as motorist_injured_bycar,
    SUM(motorist_killed_bycar) as motorist_killed_bycar,
    SUM(pedestrian_injured_bycar) as pedestrian_injured_bycar,
    SUM(pedestrian_killed_bycar) as pedestrian_killed_bycar,
    SUM(persons_injured_bycar) as persons_injured_bycar,
    SUM(persons_killed_bycar) as persons_killed_bycar,
    SUM(cyclist_injured_bysuv) as cyclist_injured_bysuv,
    SUM(cyclist_killed_bysuv) as cyclist_killed_bysuv,
    SUM(motorist_injured_bysuv) as motorist_injured_bysuv,
    SUM(motorist_killed_bysuv) as motorist_killed_bysuv,
    SUM(pedestrian_injured_bysuv) as pedestrian_injured_bysuv,
    SUM(pedestrian_killed_bysuv) as pedestrian_killed_bysuv,
    SUM(persons_injured_bysuv) as persons_injured_bysuv,
    SUM(persons_killed_bysuv) as persons_killed_bysuv,
    SUM(cyclist_injured_bytruck) as cyclist_injured_bytruck,
    SUM(cyclist_killed_bytruck) as cyclist_killed_bytruck,
    SUM(motorist_injured_bytruck) as motorist_injured_bytruck,
    SUM(motorist_killed_bytruck) as motorist_killed_bytruck,
    SUM(pedestrian_injured_bytruck) as pedestrian_injured_bytruck,
    SUM(pedestrian_killed_bytruck) as pedestrian_killed_bytruck,
    SUM(persons_injured_bytruck) as persons_injured_bytruck,
    SUM(persons_killed_bytruck) as persons_killed_bytruck,
    SUM(cyclist_injured_byother) as cyclist_injured_byother,
    SUM(cyclist_killed_byother) as cyclist_killed_byother,
    SUM(motorist_injured_byother) as motorist_injured_byother,
    SUM(motorist_killed_byother) as motorist_killed_byother,
    SUM(pedestrian_injured_byother) as pedestrian_injured_byother,
    SUM(pedestrian_killed_byother) as pedestrian_killed_byother,
    SUM(persons_injured_byother) as persons_injured_byother,
    SUM(persons_killed_byother) as persons_killed_byother,
    year || '-' || LPAD(month::text, 2, '0') as year_month
FROM crashes_all_prod_vtype_tests
Where year >=2019
GROUP BY year_month
danrademacher commented 4 years ago

Now live for testing in this gSheet: https://docs.google.com/spreadsheets/d/16ucfdJI0MANCSGCHuVaHzPlUMlFTsvolkz8aaHrrEFw/edit#gid=350030631

Even for a mode like busvan that is uncommon compared to car the single vtype query for hasvehicle_busvancombined with these allocations produces results that make sense -- other vehicles are involved but busvan is the most involved: image

danrademacher commented 4 years ago

@gregallensworth after you tackle #25 for the overall batch queries, we're ready to start rolling in the Apportionment queries into the ETL as well. This issue should contain all the info you'll need, but very likely not in the format you will need it, since it is a sequence of everything I have done to get to the final queries. But I'm likely to botch the translation if I try to do it at the level of SQL.

This comment lays out the high level goal, then we can discuss live as needed:=. I think this'll take some brain time to absorb and then see if my approach needs to be modified.

  1. One-time task: Add fields to hold the eventual apportioned data. I did that in the test table with the query in this comment,. https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/24#issuecomment-595573633

  2. One one-time task and one dynamic task: Set whether to "blame" bicycles or not and also calculate the blame factor for all blamable vehicles. I did that in this comment, https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/24#issuecomment-595618232. Note that I was on the fence about whether to add blame_factor and bike_blame as permanent residents in the table vs a subquery. They are disposable once we have the fields made in step 1 populated, but I was concerned about query performance so it seemed like maybe baking them in was better. Might be different in the batch API.

  3. Calculate the allocations. Again, in this comment https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/24#issuecomment-595625235, I added some permanent fields to hold the allocations and then used those to finally populate the fields made in step 1. I would defer to you on what's better when it comes to nested subqueries that are slower but don't clog the production database with fields we don't use the actual app, or extra columns in the DB to avoid subqueries.

gregallensworth commented 4 years ago

Commit bdb3831

I have integrated the 3 UPDATE queries into the ETL script.

The three queries use "WHERE somefield IS NULL" clauses, meaning they will only update where the number is not yet calculated. This means that although today's running time for 2 million records was about 5 minutes for each of the 3 queries, future runs should take a few seconds.

One example query and output:

make_carto_sql_api_request("""
SELECT
MAX(persons_injured_bytruck) AS it,
MAX(persons_injured_bysuv) AS is,
MAX(persons_injured_bycar) AS ic,
MAX(persons_injured_bybusvan) AS ib,
MAX(persons_injured_byscooter) AS isc,
MAX(persons_injured_bybike) AS ibk,
MAX(persons_injured_bymotorcycle) AS imc
from crashes_all_prod
""")

[
{"it":12,"is":17,"ic":27,"ib":24,"isc":2,"ibk":11,"imc":5}
]

And a screenshot from CARTO's web UI :

image

gregallensworth commented 4 years ago

Deployed.