cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

Update OCR to Handle the new CR3 form for 2023 #9704

Closed patrickm02L closed 1 year ago

patrickm02L commented 2 years ago

A new CR3 form will be implement on 1/1/23 as mandated by TxDOT. The improvement adds new field as outlined in the Functional Requirements Specification document which include:

Updated Code sheet v0.5

In order to manage this change, we will need to update the OCR to capture the data from the new fields. @frankhereford has outlined the following minor changes to bring operations back to normal operation:

  1. Define a new constellation of 10 pixels which we expect to be 100% black / #000000. This constellation is used to determine if the PDF we’re working from is a scan or a digital asset from when it was created.
  2. Define the new X,Y coordinates that define the extent of the diagram and the extent of the box around the narrative.
  3. Plugging those ~20 coordinates into the correct arrays in the python script which is run by the ETL.

Additionally, to support detecting if the CR3 is an old-style or a new one we'll need to extend the constellation test to tell us if it's digital end-to-end and also which CR3 form style we're looking at.

patrickm02L commented 2 years ago

2023 CR-3 form v.0.10

2023 CR-3 form v0.10-1.jpg 2023 CR-3 form v0.10-2.jpg 2023 CR-3 form v0.10-3.jpg 2023 CR-3 form v0.10-4.jpg

patrickm02L commented 2 years ago

In Product Sync 7/20/22.

frankhereford commented 1 year ago

Here’s the OCR/image extraction: https://github.com/cityofaustin/atd-airflow/blob/master/dags/python_scripts/cr3_extract_diagram_ocr_narrative.py. It’s in the airflow repo, which is submoduled into the prefect repo, and called from here. https://github.com/cityofaustin/atd-prefect/blob/main/flows/vision-zero/cr3_ocr_narrative_extract_diagram/cr3_ocr_narrative_extract_diagram.py.