CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
71 stars 39 forks source link

Bug: Batch queue contains Requests to process reports that failed to make it into the system. #2180

Open jimduff-usds opened 3 years ago

jimduff-usds commented 3 years ago

I believe bug #2178 is causing this after effect bug:

In today's incident, there were somehow two reports that made it into the batch queue, but never made it into the REPORT_FILE table, so when Batch wakes up to run it can’t find them.

I think I can see the problem - we’re swallowing an exception on ReportFunciton.kt line 186, then writing to the queue after it, in ReportFunction,kt line 194.

jimduff-usds commented 3 years ago

There's a bit of trickiness to recreate this bug.

First, this query is useful: select * from task T where T.created_at > now() - interval '10 minutes';

Then give yourself time: create fake data for AL, then change al-phd to be 360 times a day instead of 1440 Then comment out the line workflowengine.recordAction in ReportFunction.kt line 194, so that the recordAction never takes place. This simulates an exception thrown at this point in the code. If you want, you can throw an exception, but you'll get the same effect either way.

Then do a curl submit of a file. When BatchFunction runs, it'll throw the exception below.

Then go back and uncomment out the line 194 above, and recompile/restart and then submit the same curl a few more times, so those ones are good. All of this is to simulate a batch with one bad and several good. Do all this within your allotted time before the batch function triggers. (that's where the query above is useful)

The true badness of this error is that if there's one file in the batch with a missing Report_ID, it poisons the entire batch. When Batch function wakes up, it'll attempt to run both the failed and the successful reports, but the failed one won't be in the database, causing the entire function to fail. And it never runs again. The end.

This is the exception that occurs when the Batch function wakes up (prior to the fix):

prime_dev_1     | [2021-09-14T13:00:05.752Z] Batch function exception for event: receiver&BATCH&al-phd.elr&2021-09-14T09:00:00-04:00
prime_dev_1     | [2021-09-14T13:00:05.752Z] Result: Batch function exception for event: receiver&BATCH&al-phd.elr&2021-09-14T09:00:00-04:00
prime_dev_1     | [2021-09-14T13:00:05.752Z] Exception: Could not find dda5857a-5f16-4e59-991f-02ff0af32d85 in REPORT_FILE
prime_dev_1     | [2021-09-14T13:00:05.752Z] Stack: java.lang.IllegalStateException: Could not find dda5857a-5f16-4e59-991f-02ff0af32d85 in REPORT_FILE
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.DatabaseAccess.fetchReportFile(DatabaseAccess.kt:193)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.WorkflowEngine$handleReceiverEvent$1.invoke(WorkflowEngine.kt:311)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.WorkflowEngine$handleReceiverEvent$1.invoke(WorkflowEngine.kt:301)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.DatabaseAccess.transact$lambda-0(DatabaseAccess.kt:72)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:611)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.DefaultDSLContext.lambda$transactionResult0$3(DefaultDSLContext.java:549)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.Tools$4$1.block(Tools.java:5209)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.Tools$4.get(Tools.java:5206)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.DefaultDSLContext.transactionResult0(DefaultDSLContext.java:597)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.DefaultDSLContext.transactionResult(DefaultDSLContext.java:521)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:610)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.DatabaseAccess.transact(DatabaseAccess.kt:72)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.WorkflowEngine.handleReceiverEvent(WorkflowEngine.kt:301)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at gov.cdc.prime.router.azure.BatchFunction.run(BatchFunction.kt:40)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
prime_dev_1     | [2021-09-14T13:00:05.752Z]    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
prime_dev_1     | [2021-09-14T13:00:05.798Z]    at com.microsoft.azure.functions.worker.broker.JavaMethodInvokeInfo.invoke(JavaMethodInvokeInfo.java:22)
prime_dev_1     | [2021-09-14T13:00:05.798Z]    at com.microsoft.azure.functions.worker.broker.EnhancedJavaMethodExecutorImpl.execute(EnhancedJavaMethodExecutorImpl.java:55)
prime_dev_1     | [2021-09-14T13:00:05.798Z]    at com.microsoft.azure.functions.worker.broker.JavaFunctionBroker.invokeMethod(JavaFunctionBroker.java:57)
prime_dev_1     | [2021-09-14T13:00:05.798Z]    at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:33)
prime_dev_1     | [2021-09-14T13:00:05.798Z]    at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:10)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at com.microsoft.azure.functions.worker.handler.MessageHandler.handle(MessageHandler.java:45)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at com.microsoft.azure.functions.worker.JavaWorkerClient$StreamingMessagePeer.lambda$onNext$0(JavaWorkerClient.java:92)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
prime_dev_1     | [2021-09-14T13:00:05.799Z]    at java.base/java.lang.Thread.run(Thread.java:829)
prime_dev_1     | [2021-09-14T13:00:05.799Z] .
jimduff-usds commented 3 years ago

Note the root problem is that the TASK table doesn't Foreign key over to the REPORT_FILE table. If it did, then no TASK could ever exist without a proper entry in the REPORT_FILE table, and this bug would never occur. However, this is hard to fix because currently reportFunction.kt inserts into the TASK table first. And that insert is part of a nice transaction that aborts and removes the blob upon failure. The solution is probably to move that entire transaction to the end of ReportFunction (where the actionhistory inserts happen now), but that'll require a LOT of careful checking and testing.