cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

Query to allow another OCR pass on CR3 files which were impacted by CRIS account issues #6675

Closed frankhereford closed 3 years ago

frankhereford commented 3 years ago

@patrickm02L @mateoclarke @sergiogcx @xavierapostol

Xavier asked me to look into a crash which was missing its OCR'd narrative & the extracted diagram. I did, and I dug into what was going on with it since it has a normal, run of the mill CR3 PDF. I found that there are 851 crashes which meet these criteria:

The OCR attempt dates line up neatly with the time interval that CRIS was misbehaving in relation to Mateo's account. I peeked at the "CR3" files being downloaded during that time period, and that "text/html" mime-type was referring to the "you're not logged in" page that CRIS was returning instead of the PDF binary file.

This is the query that I have put together to investigate and isolate these crash records:

-- pick a different SELECT clause based on your needs

select c.crash_id, c.cr3_file_metadata, c.cr3_ocr_extraction_date, c.investigator_narrative_ocr, ((l.record_json->>'cr3_file_metadata')::json->'mime_type')::text, (c.cr3_file_metadata::json->>'file_size')::integer as file_size, l.record_json
--select count(distinct c.crash_id) as crash_count
--select distinct c.crash_id
from atd_txdot_crashes c
left join atd_txdot_change_log l on (c.crash_id = l.record_crash_id)
where cr3_ocr_extraction_date is not null -- crashes which have had an OCR attempt
and investigator_narrative_ocr is null -- crashes which are missing the OCR narrative
and extract(month from cr3_ocr_extraction_date) = 7 -- restrict the set of crashes to those related to the account issues around Mateo's login
and extract(year from cr3_ocr_extraction_date) = 2021
and (c.cr3_file_metadata::json->>'file_size')::integer > 50000 -- make sure we're looking at crashes for which we have a real CR3 PDF now. the CR3 PDFs are ~100K, the HTML files are ~3K
and ((l.record_json->>'cr3_file_metadata')::json->'mime_type')::text = '"text/html"' -- crashes that once had a HTML "CR3 PDF" file, which was the "you're not logged in page" while Mateo's account was disabled

If you all agree with my assessment, I propose the following query to reset the cr3_ocr_extraction_date field to null which will allow the OCR routine to give these crashes another pass now that there is a PDF in place:

update atd_txdot_crashes set cr3_ocr_extraction_date = null where crash_id in (
  select distinct c.crash_id
  from atd_txdot_crashes c
  left join atd_txdot_change_log l on (c.crash_id = l.record_crash_id)
  where cr3_ocr_extraction_date is not null
  and investigator_narrative_ocr is null
  and extract(month from cr3_ocr_extraction_date) = 7
  and extract(year from cr3_ocr_extraction_date) = 2021
  and (c.cr3_file_metadata::json->>'file_size')::integer > 50000
  and ((l.record_json->>'cr3_file_metadata')::json->'mime_type')::text = '"text/html"'
  )

Thanks!

mateoclarke commented 3 years ago

Sounds like a solid plan to me @frankhereford!

frankhereford commented 3 years ago

Thanks @mateoclarke, I'll apply the second query to staging now. Pending success and some time for anyone else to weigh in, I'll apply it to production first thing in the morning. I appreciate you giving it a once over!

frankhereford commented 3 years ago

Query executed on staging successfully with an execution time of ~10 seconds. It didn't update any rows, but this is expected based on the lag of crash data in that database.

frankhereford commented 3 years ago

Query executed in production with an execution time of 3.1 seconds. Speed up is due to running the subselect stand-alone first to let the DB get the records in question cached into memory. Thank you all for your help getting these crashes reprocessed.