cityofaustin / atd-vz-data

The technology that powers the City of Austin's Vision Zero program
https://visionzero.austin.gov/viewer/
11 stars 2 forks source link

"Unmix the paint" and populate the VZ side of the data model #1476

Closed johnclary closed 1 month ago

johnclary commented 2 months ago

This PR is a clone of https://github.com/cityofaustin/atd-vz-data/pull/1459 which John accidentally closed by deleting it's base branch

Associated issues

This PR hopes to close https://github.com/cityofaustin/atd-data-tech/issues/17207.

Items to touch on in data model sync

A note on speed and memory

This program, just by the nature of the issue it addresses, is going to compare an enormous number of data points. Previously, all the subject data was loaded into memory, which offered a huge speedup by eliminating any DB processing and IO latency at the expense of memory usage. This ended up not being workable on all of our dev machines, so I have refactored the script to use the shelve python library, which is part of the python distribution. This library offers a dictionary-like, file-backed persistent object.

Testing

Local testing for this one, but we'll be using real data. Please let me know if I can help with any of this. I've put some effort into trying to make this these instructions as complete as possible, but it's always possible that I've overlooked something. It's reasonable to expect this whole process to take an hour and some, but much of it is waiting for long-running programs to complete.

Unmixing the paint

Hard-mode with all the PDFs which we don't need (now):

find . -type f -name "*.csv" -exec cp {} /wherever_you_have_your_checkout/atd-vz-data/atd-toolbox/data_model/python/cris_csvs \;


#### Easy mode with just the CSVs
```bash
# make your way to atd-vz-data/atd-toolbox/data_model/python/cris_csvs
aws s3 cp s3://vision-zero-cris-exports/history/10-year-data-just-csvs.tar.gz .;
tar xzfv 10-year-data-just-csvs.tar.gz;
select crash_id, crash_speed_limit from atd_txdot_crashes where crash_id = 19954205;
update atd_txdot_crashes set crash_speed_limit = 42 where crash_id = 19954205;

select crash_id, unit_nbr, travel_direction from atd_txdot_units where crash_id = 19954205 and unit_nbr = 1;
update atd_txdot_units set travel_direction = 8 where crash_id = 19954205 and unit_nbr = 1;

select crash_id, unit_nbr, prsn_nbr, prsn_age from atd_txdot_primaryperson where crash_id = 19954205 and unit_nbr = 1;
update atd_txdot_primaryperson set prsn_age = 42 where crash_id = 19954205 and unit_nbr = 1 and prsn_nbr = 1;

select crash_id, unit_nbr, prsn_nbr, prsn_age from atd_txdot_person where crash_id = 19954205;
update atd_txdot_person set prsn_age = 42 where crash_id = 19954205 and unit_nbr = 1 and prsn_nbr = 2;
select 
  crashes_cris.id as crash_id, 
  --crashes_edits.*,
  units_cris.id as unit_id,
  --units_edits.*,
  people_cris.id as person_id,
  people_edits.*
from crashes_cris
join crashes_edits on (crashes_cris.id = crashes_edits.id)
join units_cris on (crashes_cris.id = units_cris.crash_id)
join units_edits on (units_cris.id = units_edits.id)
join people_cris on (units_cris.id = people_cris.unit_id)
join people_edits on (people_cris.id = people_edits.id)
order by crashes_cris.crash_date desc, units_cris.unit_nbr, people_cris.prsn_nbr

Ship list

frankhereford commented 1 month ago

@johnclary has created a replacement solution in lieu of this PR's contents which accomplish this task via pushing the data comparison operations down into postgres. You can find that work here: https://github.com/cityofaustin/atd-vz-data/pull/1486.