CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
71 stars 39 forks source link

Multiple production exceptions #6271

Open jeremy-page opened 2 years ago

jeremy-page commented 2 years ago

Describe the bug Exception occurring in prodction

Impact Unknown

To Reproduce Unknown

Expected behavior Not having exceptions :)

Logs

Exception: PSQLException: ERROR: duplicate key value violates unique constraint "action_pkey"
  Detail: Key (action_id)=(17846480) already exists.
Stack: java.lang.reflect.InvocationTargetException
    at jdk.internal.reflect.GeneratedMethodAccessor279.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at com.microsoft.azure.functions.worker.broker.JavaMethodInvokeInfo.invoke(JavaMethodInvokeInfo.java:22)
    at com.microsoft.azure.functions.worker.broker.EnhancedJavaMethodExecutorImpl.execute(EnhancedJavaMethodExecutorImpl.java:55)
    at com.microsoft.azure.functions.worker.broker.JavaFunctionBroker.invokeMethod(JavaFunctionBroker.java:62)
    at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:33)
    at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:10)
    at com.microsoft.azure.functions.worker.handler.MessageHandler.handle(MessageHandler.java:45)
    at com.microsoft.azure.functions.worker.JavaWorkerClient$StreamingMessagePeer.lambda$onNext$0(JavaWorkerClient.java:92)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.jooq.exception.DataAccessException: SQL [insert into "public"."action" ("action_id", "action_name", "action_params", "action_result", "action_response", "http_status", "content_length", "sender_ip", "sending_org", "sending_org_client", "external_name", "username") values (?, ?::"public"."task_action", ?, ?, cast(? as jsonb), ?, ?, ?, ?, ?, ?, ?) returning "public"."action"."action_id"]; ERROR: duplicate key value violates unique constraint "action_pkey"
  Detail: Key (action_id)=(17846480) already exists.
    at org.jooq_3.15.4.POSTGRES.debug(Unknown Source)
    at org.jooq.impl.Tools.translate(Tools.java:2988)
    at org.jooq.impl.DefaultExecuteContext.sqlException(DefaultExecuteContext.java:639)
    at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:349)
    at org.jooq.impl.TableRecordImpl.storeInsert0(TableRecordImpl.java:175)
    at org.jooq.impl.TableRecordImpl.lambda$storeInsert$0(TableRecordImpl.java:141)
    at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:143)
    at org.jooq.impl.TableRecordImpl.storeInsert(TableRecordImpl.java:140)
    at org.jooq.impl.UpdatableRecordImpl.store0(UpdatableRecordImpl.java:210)
    at org.jooq.impl.UpdatableRecordImpl.lambda$store$0(UpdatableRecordImpl.java:137)
    at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:143)
    at org.jooq.impl.UpdatableRecordImpl.store(UpdatableRecordImpl.java:136)
    at org.jooq.impl.UpdatableRecordImpl.store(UpdatableRecordImpl.java:128)
    at gov.cdc.prime.router.azure.ActionHistory.insertAction$prime_router(ActionHistory.kt:608)
    at gov.cdc.prime.router.azure.ActionHistory.insertAll$prime_router(ActionHistory.kt:565)
    at gov.cdc.prime.router.azure.ActionHistory.saveToDb(ActionHistory.kt:554)
    at gov.cdc.prime.router.azure.WorkflowEngine$handleProcessFailure$1.invoke(WorkflowEngine.kt:515)
    at gov.cdc.prime.router.azure.WorkflowEngine$handleProcessFailure$1.invoke(WorkflowEngine.kt:514)
    at gov.cdc.prime.router.azure.DatabaseAccess.transact$lambda-0(DatabaseAccess.kt:81)
    at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:611)
    at org.jooq.impl.DefaultDSLContext.lambda$transactionResult0$3(DefaultDSLContext.java:549)
    at org.jooq.impl.Tools$4$1.block(Tools.java:5252)
    at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
    at org.jooq.impl.Tools$4.get(Tools.java:5249)
    at org.jooq.impl.DefaultDSLContext.transactionResult0(DefaultDSLContext.java:597)
    at org.jooq.impl.DefaultDSLContext.transactionResult(DefaultDSLContext.java:521)
    at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:610)
    at gov.cdc.prime.router.azure.DatabaseAccess.transact(DatabaseAccess.kt:81)
    at gov.cdc.prime.router.azure.WorkflowEngine.handleProcessFailure(WorkflowEngine.kt:514)
    at gov.cdc.prime.router.azure.ProcessFunction.run(ProcessFunction.kt:48)
    ... 15 more
Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "action_pkey"
  Detail: Key (action_id)=(17846480) already exists.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166)
    at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:118)
    at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52)
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java)
    at org.jooq.tools.jdbc.DefaultPreparedStatement.executeQuery(DefaultPreparedStatement.java:104)
    at org.jooq.impl.AbstractDMLQuery.executeReturningQuery(AbstractDMLQuery.java:1263)
    at org.jooq.impl.AbstractDMLQuery.execute(AbstractDMLQuery.java:1088)
    at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:335)
    ... 41 more
jimduff-usds commented 2 years ago

The exceptions (about 10 or so) occurred on 8/1/2022. The ACTION table constraint violation appears to have happened for five action_ids 17846480 through 17846484, all at the same time, 07:11:49

I checked at the time, and there were no TASKs left in the Process table, so those failures evidently got picked up on retry and reprocessed.

Here are the five actions already in the ACTION table:

select action_id, action_name, action_params, created_at from action where action_id in (17846480, 17846481, 17846482, 17846483, 17846484);
 action_id | action_name |                               action_params                                |          created_at           
-----------+-------------+----------------------------------------------------------------------------+-------------------------------
  17846480 | process     | process&PROCESS&80ea3554-17e8-411a-8898-b3e93f35ccd0&None&&hhsprotect.elr& | 2022-08-01 14:11:49.503402+00
  17846481 | process     | process&PROCESS&80ea3554-17e8-411a-8898-b3e93f35ccd0&None&&hhsprotect.elr& | 2022-08-01 14:11:49.815911+00
  17846482 | process     | process&PROCESS&80ea3554-17e8-411a-8898-b3e93f35ccd0&None&&hhsprotect.elr& | 2022-08-01 14:11:50.144032+00
  17846483 | process     | process&PROCESS&80ea3554-17e8-411a-8898-b3e93f35ccd0&None&&hhsprotect.elr& | 2022-08-01 14:11:50.440919+00
  17846484 | process     | process&PROCESS&80ea3554-17e8-411a-8898-b3e93f35ccd0&None&&hhsprotect.elr& | 2022-08-01 14:11:50.768997+00
(5 rows)

I believe this is a bug that a report is listed as being processed five times in a row!

Here's what's in the TASK and REPORT_FILE table for the above report_id:

> select * from task where report_id = '80ea3554-17e8-411a-8898-b3e93f35ccd0';
              report_id               | next_action | next_action_at |      schema_name       | receiver_name | item_count | body_format |                                                                                body_url                                                                                 |          created_at           | translated_at | batched_at | sent_at | wiped_at | errored_at | retry_token |         processed_at          
--------------------------------------+-------------+----------------+------------------------+---------------+------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------------+------------+---------+----------+------------+-------------+-------------------------------
 80ea3554-17e8-411a-8898-b3e93f35ccd0 | none        |                | direct/abbott-covid-19 |               |          1 | INTERNAL    | https://pdhprodstorageaccount.blob.core.windows.net/reports/process%2Fabbott.default%2Fabbott-covid-19-80ea3554-17e8-411a-8898-b3e93f35ccd0-20220801141049.internal.csv | 2022-08-01 14:10:49.836161+00 |               |            |         |          |            |             | 2022-08-01 14:10:52.694834+00
(1 row)

prime_data_hub=> select * from report_file where report_id = '80ea3554-17e8-411a-8898-b3e93f35ccd0';
              report_id               | action_id | next_action | next_action_at | sending_org | sending_org_client | receiving_org | receiving_org_svc | transport_params | transport_result |      schema_name       | schema_topic |                                                                                body_url                                                                                 | external_name | body_format |                            blob_digest                             | item_count | wiped_at |          created_at           | downloaded_by | item_count_before_qual_filter 
--------------------------------------+-----------+-------------+----------------+-------------+--------------------+---------------+-------------------+------------------+------------------+------------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+-------------+--------------------------------------------------------------------+------------+----------+-------------------------------+---------------+-------------------------------
 80ea3554-17e8-411a-8898-b3e93f35ccd0 |  17845822 | process     |                |             |                    |               |                   |                  |                  | direct/abbott-covid-19 | covid-19     | https://pdhprodstorageaccount.blob.core.windows.net/reports/process%2Fabbott.default%2Fabbott-covid-19-80ea3554-17e8-411a-8898-b3e93f35ccd0-20220801141049.internal.csv |               | INTERNAL    | \x2a31fe07099f84ec0b309bede568a8fe453b12cdcd1a6d5c3e0e35d5b0f36e03 |          1 |          | 2022-08-01 14:10:49.877883+00 |               |                              
(1 row)

And it does appear that this one abbott record was sent to hhsprotect five times:

> select report_id, created_at, next_action, receiving_org, receiving_org_svc from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '80ea3554-17e8-411a-8898-b3e93f35ccd0');
              report_id               |          created_at           | next_action | receiving_org | receiving_org_svc 
--------------------------------------+-------------------------------+-------------+---------------+-------------------
 0dc66d3c-6043-4f82-bb42-33143424df90 | 2022-08-01 14:11:50.440919+00 | batch       | hhsprotect    | elr
 74e4135b-51fd-46d2-874f-78904720319c | 2022-08-01 14:11:49.503402+00 | batch       | hhsprotect    | elr
 5b683598-69eb-4752-91e2-1651cbf67732 | 2022-08-01 14:11:50.144032+00 | batch       | hhsprotect    | elr
 780a92e5-61fa-40b6-84c1-2fdcb6b07ed1 | 2022-08-01 14:10:52.659177+00 | batch       | hhsprotect    | elr
 0b7e48da-d734-4dbc-9698-b8aac39bf359 | 2022-08-01 14:11:49.815911+00 | batch       | hhsprotect    | elr
 568bf9e0-0b2e-4b66-951d-41e2402e3cf4 | 2022-08-01 14:11:50.768997+00 | batch       | hhsprotect    | elr
(6 rows)

This is the first time I've seen this kind of error, so perhaps its a rare glitch in the system, not a pattern. Still it breaks our run-once semantics.