cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

VZD Bug - Crash narratives missing even though Diagram and CR3 PDFs exist #4871

Closed mateoclarke closed 3 years ago

mateoclarke commented 3 years ago
mateoclarke commented 3 years ago

Frank is currently working on this and the current lead is that there may be a issue with quotes effecting the graphql mutation.

frankhereford commented 3 years ago

I am currently spinning up a little dev environment for this so I can replicate and debug this issue. I hope to nail this down tomorrow, Friday the 8th.

Here is a query which identifies all the crashes which are in the boat of having a diagram but no narrative:

SELECT crash_id, crash_date, investigator_narrative, cr3_file_metadata->>'diagram_s3_file', cr3_ocr_extraction_date 
FROM atd_txdot_crashes 
WHERE 1 = 1 
AND cr3_ocr_extraction_date IS NOT null 
AND investigator_narrative = '' 
AND cr3_file_metadata->>'diagram_s3_file' IS NOT null
ORDER BY cr3_ocr_extraction_date DESC
frankhereford commented 3 years ago

Dev environment up and running. I've not yet been able to reproduce this bug; all the crashes I've tried the program on are successfully having the narrative OCR and updated into the database via the Hasura endpoint. I have artificially injected spurious single and double quotes into the narrative string prior to passing them into the update function, and the Hasura/JSON/POST mechanism has not failed. Will update with progress when I figure out something new.

frankhereford commented 3 years ago

I think I found the cause of this issue. The column where the narrative is stored is called 'investigator_narrative,' and we opted to use it to store the narrative because it was extant and empty in the database. However, I believe this column is one that exists in the CRIS data, even though they do not populate it. However, now, when a crash is updated via the CRIS ETL, that incoming data is enforcing the column to be blank, which results in this missing-narrative situation. This behavior can be observed in the record history like this:

image

Looking forward to the dev sync to discuss the best route forward, be it a new column or moving the narrative into the cr3_file_metadata JSONB object, or other options, as well as what's the best route forward to repair the records returned by the above SQL query to get them repopulated.

frankhereford commented 3 years ago

Memoranda from the dev sync:

frankhereford commented 3 years ago

I have a one-off script written that is currently populating a local copy of the VZ database with narratives. @sergiogcx - Thank you for the backup from yesterday; it worked great. When it's done, I'll produce the SQL file with UPDATEs mentioned above. It looks to be doing one every few seconds, so the process should be done in 3-4 hours.

frankhereford commented 3 years ago

I sent @mateoclarke & @sergiogcx a gzip'd SQL script via email with ~3800 UPDATE statements to populate the missing narratives. If y'all have anything that I can do to augment this data or help in any other way, please let me know. Thanks!

mateoclarke commented 3 years ago

@frankhereford, the data included in your SQL script has now been added to production. I'm going to go ahead and close this issue. 👏

frankhereford commented 3 years ago

Tonight, in the course of doing QA on a number of crashes prior to making a crash diagram, I started to notice that there were a number of crashes in the VZE that I was encountering that had diagrams but no narrative. I checked the CR3s, and they were digital end-to-end and had narratives, which leads me to expect them to be OCR'able. I wanted to check on this, so I pulled down the latest backup and populated a database with it here on my local server with my local hasura endpoint pointed at it.

With this configuration, I'm able to test the OCR program against "production" data but without the production hasura endpoint.

I have found that the program is successfully doing the OCR, but something was (not currently though, see below) preventing it from getting into the investigator_narrative_ocr field. An example is:

#16575056 which has a nominal CR3 (which I won't post here for PII reasons), and successful OCR, such as:

image

The following query returns 12,887 records:

SELECT crash_id, crash_date, investigator_narrative_ocr, cr3_file_metadata->>'diagram_s3_file', cr3_ocr_extraction_date 
FROM atd_txdot_crashes 
WHERE 1 = 1 
AND cr3_ocr_extraction_date IS NOT null 
AND (investigator_narrative_ocr = '' OR investigator_narrative_ocr is null)
AND cr3_file_metadata->>'diagram_s3_file' IS NOT null
--AND crash_id = 16575056
ORDER BY cr3_ocr_extraction_date DESC

and indicates that 99.99% of the crashes missing narratives in this situation were processed between the 9th and 11th of January.

Based on this, I believe that there is not a currently outstanding bug in the process, but there are 12K records for which a narrative needs to be extracted, recorded and inserted into the database. I believe this group of missing narratives are due to the field, investigator_narrative_ocr, being NULL when my previous query in the above comment from Jan-07 was only was looking for narratives with the empty string.

@sergiogcx @mateoclarke:

I hope this comment illustrates that there is missing data and how to find it. With your agreement and approval, I propose that I use my local PG+Hasura instance, in combination with my little perl script that I used last time provide missing narratives, to extract these narratives and then create a series of ~12,887 UPDATE statements to populate these missing narratives. I'll provide this to y'all, and if you approve, you can apply it to the production database. How does this sound?

mateoclarke commented 3 years ago

Sounds like a plan. Thanks Frank!

frankhereford commented 3 years ago

Thanks for running that script - everything looks great.