CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
71 stars 39 forks source link

High staging database CPU usage #13624

Open JosiahSiegel opened 6 months ago

JosiahSiegel commented 6 months ago

Problem statement

Numerous staging exceptions:

org.postgresql.util.PSQLException: ERROR: insert or update on table "report_lineage" violates foreign key constraint "report_lineage_parent_report_id_fkey" Detail: Key (parent_report_id)=(ed574384-adb3-4f1c-b028-78de30e7a02a) is not present in table "report_file". 

Note: parent_report_id is not static to the one shown.

What you need to know

image image

Example for research:

//report_lineage:   2024-03-08 13:20:09.570013+00
//Error reported:   2024-03-08 13:20:09.3586498Z
//report_file:      2024-03-08 13:20:09.163705+00

traces 
| where message contains "ed574384-adb3-4f1c-b028-78de30e7a02a" or operation_Id in ("724796342b891d227f6541548cbd683e","72ca77da8df295c44ad096817123d002")
| extend LogLevel_ = tostring(customDimensions.LogLevel), HostInstanceId_ = tostring(customDimensions.HostInstanceId), ThreadName_ = tostring(customDimensions.ThreadName)
| project timestamp, LogLevel_, message, HostInstanceId_, ThreadName_, operation_Id, operation_Name, operation_ParentId, customDimensions

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

timestamp [Local Time] | LogLevel_ | message | HostInstanceId_ | ThreadName_ | operation_Id | operation_Name -- | -- | -- | -- | -- | -- | -- 3/8/2024, 8:20:09.215 AM | Information | Executing 'Functions.convert-fhir' (Reason='New queue message detected on 'elr-fhir-convert'.', Id=e77f3daa-4fa9-418d-8370-3c02ea22e2da) | c88ac4af-2287-49ac-9115-2f8837092b67 |   | 724796342b891d227f6541548cbd683e | convert-fhir 3/8/2024, 8:20:09.216 AM |   | InvocationRequest received by the Java worker | c88ac4af-2287-49ac-9115-2f8837092b67 | pool-8-thread-1594 | 724796342b891d227f6541548cbd683e |   3/8/2024, 8:20:09.327 AM |   | Starting uploadBlob of route/connectathon.CON_FULL_ELR_SENDER/None-63c77b07-fd4a-4b06-ae26-8f3495d59594-20240308132009.fhir | c88ac4af-2287-49ac-9115-2f8837092b67 | pool-8-thread-1594 | 724796342b891d227f6541548cbd683e |   3/8/2024, 8:20:09.337 AM |   | Done uploadBlob of route/connectathon.CON_FULL_ELR_SENDER/None-63c77b07-fd4a-4b06-ae26-8f3495d59594-20240308132009.fhir | c88ac4af-2287-49ac-9115-2f8837092b67 | pool-8-thread-1594 | 724796342b891d227f6541548cbd683e |   3/8/2024, 8:20:09.344 AM |   | Inserted row into ACTION: action_name=convert, params= {"type":"convert","reportId":"ed574384-adb3-4f1c-b028-78de30e7a02a", action_id=4096198 | c88ac4af-2287-49ac-9115-2f8837092b67 | pool-8-thread-1594 | 724796342b891d227f6541548cbd683e |   3/8/2024, 8:20:09.344 AM | Information |   | c88ac4af-2287-49ac-9115-2f8837092b67 |   |   |   3/8/2024, 8:20:09.353 AM | Error | Executed 'Functions.convert-fhir' (Failed, Id=e77f3daa-4fa9-418d-8370-3c02ea22e2da, Duration=137ms) | c88ac4af-2287-49ac-9115-2f8837092b67 |   | 724796342b891d227f6541548cbd683e | convert-fhir 3/8/2024, 8:20:09.459 AM |   | InvocationRequest received by the Java worker | 517062a1-674d-4656-a957-3ad79be630ea | pool-8-thread-1564 | 72ca77da8df295c44ad096817123d002 |   3/8/2024, 8:20:09.572 AM |   | Starting uploadBlob of route/connectathon.CON_FULL_ELR_SENDER/None-54c552cc-9fb3-4748-9a72-e8366a43c1c8-20240308132009.fhir | 517062a1-674d-4656-a957-3ad79be630ea | pool-8-thread-1564 | 72ca77da8df295c44ad096817123d002 |   3/8/2024, 8:20:09.587 AM |   | Done uploadBlob of route/connectathon.CON_FULL_ELR_SENDER/None-54c552cc-9fb3-4748-9a72-e8366a43c1c8-20240308132009.fhir | 517062a1-674d-4656-a957-3ad79be630ea | pool-8-thread-1564 | 72ca77da8df295c44ad096817123d002 |   3/8/2024, 8:20:09.594 AM |   | Inserted row into ACTION: action_name=convert, params= {"type":"convert","reportId":"ed574384-adb3-4f1c-b028-78de30e7a02a", action_id=4096199 | 517062a1-674d-4656-a957-3ad79be630ea | pool-8-thread-1564 | 72ca77da8df295c44ad096817123d002 |   3/8/2024, 8:20:09.604 AM |   | Action history for action 'convert' has been recorded | 517062a1-674d-4656-a957-3ad79be630ea | pool-8-thread-1564 | 72ca77da8df295c44ad096817123d002 |  

JosiahSiegel commented 6 months ago

I noticed we are calling store() on separate occasions for report_file and report_lineage. I could be wrong, but jooq docs mention store() triggering an insert/update

arnejduranovic commented 3 months ago

This issue can be easily reproduced by running the simple load test in Azure. This test spams the pipeline with messages, which seems to be the way to trigger this issue.