Closed atdservicebot closed 2 years ago
TL;DR: The problem is resolved and was traced back to the authentication errors we had with TXDOT earlier this summer. When our primary account expired, the PDFs we retrieved during that time period weren't CR3s Forms and instead a PDF version of the CRIS login page. That problem was addressed and we got the true CR3 PDFs but we hadn't yet re-run the OCR program on the effected records. Below, I describe the process we went through to re-run the OCR program on around 1,266 crash records...
Longer explanation:
First we checked, over the past 5 years, how many records have a blank/null/empty investigator narrative that have had the OCR program run on its CR3 PDF?
SELECT
*
FROM
"public"."atd_txdot_crashes"
WHERE ("investigator_narrative_ocr" IS NULL
OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL
The query RETURNS 1,266 crashes
The proposed solution was to null out that cr3_ocr_extraction_date
field so that OCR program knows to re-run on all these 1,266 records.
As a first pass test, we focus on the 16 crash record that Xavier reported...
UPDATE
"public"."atd_txdot_crashes"
SET
"cr3_ocr_extraction_date" = NULL
WHERE ("investigator_narrative_ocr" IS NULL
OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL
AND "crash_id" IN(18321640, 18540151, 18544308, 17828062, 18321657, 18322191, 18544238, 17829566, 18321503, 17828037, 18323590, 17823318, 17824749, 17828059, 18321253, 18321405)
Once those values were nulled out, all 16 records were fixed automatically. The OCR program is scheduled to run every 10 min. This dropped the count of records that have a blank/null/empty investigator narrative and have had the OCR program run on its CR3 PDF to 1,253 records.
So we run the query that sets cr3_ocr_extraction_date
on the full batch after the successful test:
UPDATE
"public"."atd_txdot_crashes"
SET
"cr3_ocr_extraction_date" = NULL
WHERE ("investigator_narrative_ocr" IS NULL
OR "investigator_narrative_ocr" = '')
AND "cr3_ocr_extraction_date" IS NOT NULL
AND "crash_date" > now() - '5 YEAR'::INTERVAL
Query 1 OK: UPDATE 1253, 1253 rows affected
After that was completed and the automated OCR program had been running for several couple hours, the number of records that had the OCR program run and still were not able to extract an investigator narrative is down to 448 crashes (in the past 5 years). Only 10 of those crashes are within Austin Full Purpose. So the remaining crashes are likely from either 1) agencies that manually scan their CR3 (which the OCR extraction program is designed to skip), 2) the crash diagram is on an attached page instead of its normal space in the PDF form template, or 3) the text narrative is truly blank.
When I examined the 10 records in Austin Full Purpose a few do have extracted crash diagrams even if we didn't get the narrative text.
With that, this issue can be retired.
Vision Zero (Editor)
I am noticing some crashes do not have a narrative or map populating. Here are some examples:
https://visionzero.austin.gov/editor/#/crashes/
Internet Browser: Chrome
Request ID: DTS21-102895