cityofaustin / vision-zero

The technology that powers the City of Austin's Vision Zero program
https://visionzero.austin.gov/viewer/
12 stars 4 forks source link

Return of the OCR'd crash narrative #1568

Closed johnclary closed 1 month ago

johnclary commented 1 month ago

Associated issues

Testing

  1. You will need a copy of the database that includes the change_log_crashes_cris table data. To include just this one change log table in your replica, comment out change_log_crashes_cris right here in the helper script.
# ./vision-zero helper script
TABLES_TO_IGNORE = [
    "public.change_log_crashes",
     #  "public.change_log_crashes_cris",
    "public.change_log_units_cris",
    "public.change_log_people_cris",
    "public.change_log_people",
    "public.change_log_units",
    "public.change_log_people_edits",
    "public.change_log_units_edits",
    "public.change_log_crashes_edits",
]
  1. And now replicate like normal. This will download ~4gb of data:
./vision-zero replicate-db
  1. Apply migrations and metadata
hasura migrate apply
hasura metadata apply
  1. Now you will use the change log to backfill narratives that have been erased in recent weeks. Locate the commented out update statement at the bottom of the 1727451510064_preserve_crash_narrative up migration (here). Manually execute this update command in your sql client.

  2. At this point there should be ~800 remaining crashes that are missing a narrative and need to be OCR'd. You can check this by querying the new SQL view:

select count(*) from view_crash_narratives_ocr_todo;
  1. It's time to test the OCR script. head to the ./etl/cris_import directory and rebuild the ETL docker image:
docker compose build
  1. We need to set our environment to use the prod S3 bucket, because it contains all of the CR3 PDFs that we will process. The best way to do this is to save a copy of your existing .env file as prod.local.env, and set the BUCKET_ENV value to prod in this new file (this PR renames the ENV var to BUCKET_ENV). Make sure your Hasura endpoint is set to you local host.

  2. Now it really is time to run the OCR script. Use this run command to run the OCR process with your new env file:

docker run -it  --rm --env-file prod.local.env --network host -v $PWD:/app cris_import_cris_import ./cr3_ocr_narrative.py --verbose
  1. Nice! You can inspect the new narratives by querying the crashes_edits table:
select
    id,
    updated_at,
    updated_by,
    investigator_narrative
from
    crashes_edits
where
    investigator_narrative is not null;
  1. Now we will test the CRIS import and make sure that our narrative-preserving trigger is working as expected. First, let's inspect a batch of crashes the should have investigator narratives, observing that the narrative is populated for all records:
select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);
  1. I have placed a CRIS extract in the dev S3 inbox for testing (extract_2023_20240823135638635_99726_20240920_HAYSTRAVISWILLIAMSON), and this extract is missing a large number of crash narratives, including for those crashes listed above. Let's make sure those narratives are not going to be overwritten, thanks to the new DB trigger.

Make sure your BUCKET_ENV is set to dev in your .env file, and run the CRIS import like so:

docker run -it  --rm --env-file .env --network host -v $PWD:/app cris_import_cris_import ./cris_import.py --csv --s3-download
  1. Excellent—now query the DB again and make sure these crash narratives are still intact:
select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);

That's it—thank you for testing this PR 👍


Ship list