cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

Vision Zero Editor narrative and map not populating #7542

Closed atdservicebot closed 2 years ago

atdservicebot commented 2 years ago

What application are you using?

Vision Zero (Editor)

Describe the problem.

I am noticing some crashes do not have a narrative or map populating. Here are some examples:

Website Address

https://visionzero.austin.gov/editor/#/crashes/

Internet Browser: Chrome

Requested By Xavier A.

Request ID: DTS21-102895

mateoclarke commented 2 years ago

TL;DR: The problem is resolved and was traced back to the authentication errors we had with TXDOT earlier this summer. When our primary account expired, the PDFs we retrieved during that time period weren't CR3s Forms and instead a PDF version of the CRIS login page. That problem was addressed and we got the true CR3 PDFs but we hadn't yet re-run the OCR program on the effected records. Below, I describe the process we went through to re-run the OCR program on around 1,266 crash records...

Longer explanation:

First we checked, over the past 5 years, how many records have a blank/null/empty investigator narrative that have had the OCR program run on its CR3 PDF?

SELECT
    *
FROM
    "public"."atd_txdot_crashes"
WHERE ("investigator_narrative_ocr" IS NULL
    OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL

The query RETURNS 1,266 crashes

The proposed solution was to null out that cr3_ocr_extraction_date field so that OCR program knows to re-run on all these 1,266 records.

As a first pass test, we focus on the 16 crash record that Xavier reported...

UPDATE
    "public"."atd_txdot_crashes"
SET
    "cr3_ocr_extraction_date" = NULL
WHERE ("investigator_narrative_ocr" IS NULL
    OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL
AND "crash_id" IN(18321640, 18540151, 18544308, 17828062, 18321657, 18322191, 18544238, 17829566, 18321503, 17828037, 18323590, 17823318, 17824749, 17828059, 18321253, 18321405)

Once those values were nulled out, all 16 records were fixed automatically. The OCR program is scheduled to run every 10 min. This dropped the count of records that have a blank/null/empty investigator narrative and have had the OCR program run on its CR3 PDF to 1,253 records.

So we run the query that sets cr3_ocr_extraction_date on the full batch after the successful test:

UPDATE
    "public"."atd_txdot_crashes"
SET
    "cr3_ocr_extraction_date" = NULL
WHERE ("investigator_narrative_ocr" IS NULL
    OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL

Query 1 OK: UPDATE 1253, 1253 rows affected

After that was completed and the automated OCR program had been running for several couple hours, the number of records that had the OCR program run and still were not able to extract an investigator narrative is down to 448 crashes (in the past 5 years). Only 10 of those crashes are within Austin Full Purpose. So the remaining crashes are likely from either 1) agencies that manually scan their CR3 (which the OCR extraction program is designed to skip), 2) the crash diagram is on an attached page instead of its normal space in the PDF form template, or 3) the text narrative is truly blank.

When I examined the 10 records in Austin Full Purpose a few do have extracted crash diagrams even if we didn't get the narrative text.

With that, this issue can be retired.