"Unmix the paint" and populate the VZ side of the data model

frankhereford commented 3 months ago

Associated issues

This PR hopes to close https://github.com/cityofaustin/atd-data-tech/issues/17207.

Items to touch on in data model sync

Combining the prsn_death_date and prsn_death_time fields ✅

A note on speed and memory

This program, just by the nature of the issue it addresses, is going to compare an enormous number of data points. Previously, all the subject data was loaded into memory, which offered a huge speedup by eliminating any DB processing and IO latency at the expense of memory usage. This ended up not being workable on all of our dev machines, so I have refactored the script to use the shelve python library, which is part of the python distribution. This library offers a dictionary-like, file-backed persistent object.

Testing

Local testing for this one, but we'll be using real data. Please let me know if I can help with any of this. I've put some effort into trying to make this these instructions as complete as possible, but it's always possible that I've overlooked something. It's reasonable to expect this whole process to take an hour and some, but much of it is waiting for long-running programs to complete.

Unmixing the paint

Pick easy mode or hard mode immediately below.

Hard-mode with all the PDFs which we don't need (now):

Download the CRIS exports for the last 10 years or so and save them to your computer

# make yourself some throw away directory and change into it
aws s3 sync s3://vision-zero-cris-exports/history/ . # 36 GB of data

Unzip them and move them into the right location


for file in *.zip; do
dirname="${file%.*}"
7za x "$file" -p<extract password> -o"$dirname"
done;

find . -type f -name "*.csv" -exec cp {} /wherever_you_have_your_checkout/atd-vz-data/atd-toolbox/data_model/python/cris_csvs \;


#### Easy mode with just the CSVs
```bash
# make your way to atd-vz-data/atd-toolbox/data_model/python/cris_csvs
aws s3 cp s3://vision-zero-cris-exports/history/10-year-data-just-csvs.tar.gz .;
tar xzfv 10-year-data-just-csvs.tar.gz;

Grab the latest production data and populate your local DB

# cd back up to the root of the repo and activate your venv if needed
./vision-zero replicate-db

Apply migrations & metadata

cd atd-vzd;
hasura migrate apply;
hasura metadata apply;

We're going to populate the CRIS side of the data model now with John's script

# make your way to atd-vz-data/atd-toolbox/data_model/python and activate a venv if needed
python ./cris_import.py # ~ 25 minutes

Next, copy the env_template to env. The "secret" in it is actually more of a configuration item, so you'll find the value is already populated ready for local testing.
```
# change directories to atd-vz-data/atd-toolbox/isolate_vz_data_differences
cp env_template env;
```

Let's get the docker image fired up

# in atd-vz-data/atd-toolbox/isolate_vz_data_differences
docker compose build;
docker compose run import; # should drop you at a docker container's bash prompt

We've done all the prep work to run the program, and optionally, you can now snap a backup of the DB right now, so you can come back to the point in your testing if you want.

# still in the docker container from above
./backup_db.sh
# at any time, you can spin this container back up with the docker compose run command and do this to reset the DB to this state
./restore_db.sh

Here's a good time to put in some artificial VZ data which we can look for later. You can do this or any other testing changes you want with your DB client.

select crash_id, crash_speed_limit from atd_txdot_crashes where crash_id = 19954205;
update atd_txdot_crashes set crash_speed_limit = 42 where crash_id = 19954205;

select crash_id, unit_nbr, travel_direction from atd_txdot_units where crash_id = 19954205 and unit_nbr = 1;
update atd_txdot_units set travel_direction = 8 where crash_id = 19954205 and unit_nbr = 1;

select crash_id, unit_nbr, prsn_nbr, prsn_age from atd_txdot_primaryperson where crash_id = 19954205 and unit_nbr = 1;
update atd_txdot_primaryperson set prsn_age = 42 where crash_id = 19954205 and unit_nbr = 1 and prsn_nbr = 1;

select crash_id, unit_nbr, prsn_nbr, prsn_age from atd_txdot_person where crash_id = 19954205;
update atd_txdot_person set prsn_age = 42 where crash_id = 19954205 and unit_nbr = 1 and prsn_nbr = 2;

Now we're going to unmix the paint

# still in atd-vz-data/atd-toolbox/isolate_vz_data_differences and inside the docker container from above
./consume_spreadsheet_do_checks.py # ~ 12 minutes, and I sure could have given that a better name 🤷

And here we can look at the new data model to see if our changes show up where we expect them.

select crash_id, crash_speed_limit from crashes where crash_id = 19954205; -- 42
select crash_id, unit_nbr, veh_trvl_dir_id from units where cris_crash_id = 19954205 and unit_nbr = 1; -- 8
select id, prsn_nbr, prsn_age from people where unit_id = (select id from units where cris_crash_id = 19954205 and unit_nbr = 1); -- 42 & 42

Additionally, here is a query you can use to inspect the various _edits tables to look for patterns in the overridden data. At first blush, it looks good with the possible exception of how we're overriding a lot of names using data from the existing VZ data. I suspect that the program may need to be enhanced to build out some more nuanced rules around what names we're intending to pull in.
- With this query, you can comment in and out the wild-card select query clauses (and/or the joins) to get a bead on what the "edited" data looks like in tabular format.

select 
  crashes_cris.id as crash_id, 
  --crashes_edits.*,
  units_cris.id as unit_id,
  --units_edits.*,
  people_cris.id as person_id,
  people_edits.*
from crashes_cris
join crashes_edits on (crashes_cris.id = crashes_edits.id)
join units_cris on (crashes_cris.id = units_cris.crash_id)
join units_edits on (units_cris.id = units_edits.id)
join people_cris on (units_cris.id = people_cris.unit_id)
join people_edits on (people_cris.id = people_edits.id)
order by crashes_cris.crash_date desc, units_cris.unit_nbr, people_cris.prsn_nbr

Ship list

[x] Write the program once and then scrap it and write a better one
[ ] Check migrations for any conflicts with latest migrations in master branch
[x] Confirm Hasura role permissions for necessary access
[ ] Code reviewed
[ ] Product manager approved

frankhereford commented 2 months ago

Thanks for the extra day on this y'all after I realized how much better it could be done based on @johnclary's spreadsheet instead of an overly general way. This new program unrolls a lot of the loops and allows for little conditions and handling of fiddly bits of data in the various tables. The four functions, one for each table being processed, are very similar, but different in tiny ways to handle the idiosyncratic nature of each conversion.

Thanks!

johnclary commented 2 months ago

Ooops, I deleted this PR's base. Sorry! I have opened https://github.com/cityofaustin/atd-vz-data/pull/1476 as a replacement.

cityofaustin / atd-vz-data