Bug: ProcessFunction failed with a BlobAlreadyExists exception

jimduff-usds commented 2 years ago

Describe the bug ProcessFunction failed with a BlobAlreadyExists exception

Impact Likely, one or more reports failed to complete processing.

To Reproduce Steps to reproduce the behavior:

Look for exception in Production on 8/7/2022, 4:25:15.094 PM (Local time).

Expected behavior This error should never occur. I'm pretty sure I haven't seen this before. I haven't looked at the code recently, but doesn't the Process function only produce internal report files, which should be given filenames with UUIDs embedded, and therefore always be unique??

Perplexing. I wonder if this is somehow a race condition between multiple attempts to process?

Logs The exception log is in the comment, below.

jimduff-usds commented 2 years ago

Exception:

Exception while executing function: Functions.process Result: Failure Exception: BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExistsThe specified blob already exists. RequestId:af942fbf-901e-003e-329b-aa3570000000 Time:2022-08-07T20:25:14.9160157Z" Stack: java.lang.reflect.InvocationTargetException at jdk.internal.reflect.GeneratedMethodAccessor303.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.microsoft.azure.functions.worker.broker.JavaMethodInvokeInfo.invoke(JavaMethodInvokeInfo.java:22) at com.microsoft.azure.functions.worker.broker.EnhancedJavaMethodExecutorImpl.execute(EnhancedJavaMethodExecutorImpl.java:55) at com.microsoft.azure.functions.worker.broker.JavaFunctionBroker.invokeMethod(JavaFunctionBroker.java:62) at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:33) at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:10) at com.microsoft.azure.functions.worker.handler.MessageHandler.handle(MessageHandler.java:45) at com.microsoft.azure.functions.worker.JavaWorkerClient$StreamingMessagePeer.lambda$onNext$0(JavaWorkerClient.java:92) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: com.azure.storage.blob.models.BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExistsThe specified blob already exists. RequestId:af942fbf-901e-003e-329b-aa3570000000 Time:2022-08-07T20:25:14.9160157Z" at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) at com.azure.core.implementation.http.rest.ResponseExceptionConstructorCache.invoke(ResponseExceptionConstructorCache.java:56) at com.azure.core.implementation.http.rest.RestProxyBase.instantiateUnexpectedException(RestProxyBase.java:378) at com.azure.core.implementation.http.rest.AsyncRestProxy.lambda$ensureExpectedStatus$1(AsyncRestProxy.java:116) at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106) at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:113) at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:126) at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:224) at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:113) at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:191) at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816) at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:129) at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260) at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:144) at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:400) at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:419) at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:473) at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:703) at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:93) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299) at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372) at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1284) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:487) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Suppressed: java.lang.Exception: #block terminated with an error at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:99) at reactor.core.publisher.Mono.block(Mono.java:1707) at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:191) at com.azure.storage.blob.BlobClient.uploadWithResponse(BlobClient.java:306) at com.azure.storage.blob.BlobClient.uploadWithResponse(BlobClient.java:267) at com.azure.storage.blob.BlobClient.upload(BlobClient.java:212) at com.azure.storage.blob.BlobClient.upload(BlobClient.java:192) at gov.cdc.prime.router.azure.BlobAccess$Companion.uploadBlob$prime_router(BlobAccess.kt:160) at gov.cdc.prime.router.azure.BlobAccess$Companion.uploadBlob$prime_router$default(BlobAccess.kt:152) at gov.cdc.prime.router.azure.BlobAccess$Companion.uploadBody(BlobAccess.kt:136) at gov.cdc.prime.router.azure.BlobAccess.generateBodyAndUploadReport(BlobAccess.kt:67) at gov.cdc.prime.router.azure.WorkflowEngine.dispatchReport(WorkflowEngine.kt:219) at gov.cdc.prime.router.azure.WorkflowEngine.dispatchReport$default(WorkflowEngine.kt:196) at gov.cdc.prime.router.azure.WorkflowEngine.sendToDestination(WorkflowEngine.kt:454) at gov.cdc.prime.router.azure.WorkflowEngine.access$sendToDestination(WorkflowEngine.kt:48) at gov.cdc.prime.router.azure.WorkflowEngine$routeReport$3.invoke(WorkflowEngine.kt:419) at gov.cdc.prime.router.azure.WorkflowEngine$routeReport$3.invoke(WorkflowEngine.kt:417) at gov.cdc.prime.router.azure.DatabaseAccess.transact$lambda-0(DatabaseAccess.kt:81) at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:611) at org.jooq.impl.DefaultDSLContext.lambda$transactionResult0$3(DefaultDSLContext.java:549) at org.jooq.impl.Tools$4$1.block(Tools.java:5252) at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) at org.jooq.impl.Tools$4.get(Tools.java:5249) at org.jooq.impl.DefaultDSLContext.transactionResult0(DefaultDSLContext.java:597) at org.jooq.impl.DefaultDSLContext.transactionResult(DefaultDSLContext.java:521) at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:610) at gov.cdc.prime.router.azure.DatabaseAccess.transact(DatabaseAccess.kt:81) at gov.cdc.prime.router.azure.WorkflowEngine.routeReport(WorkflowEngine.kt:417) at gov.cdc.prime.router.azure.WorkflowEngine$handleProcessEvent$1.invoke(WorkflowEngine.kt:544) at gov.cdc.prime.router.azure.WorkflowEngine$handleProcessEvent$1.invoke(WorkflowEngine.kt:528) at gov.cdc.prime.router.azure.DatabaseAccess.transact$lambda-0(DatabaseAccess.kt:81) at org.jooq.impl.DefaultDSLContext.lambda$transaction$5(DefaultDSLContext.java:611) at org.jooq.impl.DefaultDSLContext.lambda$transactionResult0$3(DefaultDSLContext.java:549) at org.jooq.impl.Tools$4$1.block(Tools.java:5252) at java.base/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) at org.jooq.impl.Tools$4.get(Tools.java:5249) at org.jooq.impl.DefaultDSLContext.transactionResult0(DefaultDSLContext.java:597) at org.jooq.impl.DefaultDSLContext.transactionResult(DefaultDSLContext.java:521) at org.jooq.impl.DefaultDSLContext.transaction(DefaultDSLContext.java:610) at gov.cdc.prime.router.azure.DatabaseAccess.transact(DatabaseAccess.kt:81) at gov.cdc.prime.router.azure.WorkflowEngine.handleProcessEvent(WorkflowEngine.kt:528) at gov.cdc.prime.router.azure.ProcessFunction.run(ProcessFunction.kt:41) at jdk.internal.reflect.GeneratedMethodAccessor303.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.microsoft.azure.functions.worker.broker.JavaMethodInvokeInfo.invoke(JavaMethodInvokeInfo.java:22) at com.microsoft.azure.functions.worker.broker.EnhancedJavaMethodExecutorImpl.execute(EnhancedJavaMethodExecutorImpl.java:55) at com.microsoft.azure.functions.worker.broker.JavaFunctionBroker.invokeMethod(JavaFunctionBroker.java:62) at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:33) at com.microsoft.azure.functions.worker.handler.InvocationRequestHandler.execute(InvocationRequestHandler.java:10) at com.microsoft.azure.functions.worker.handler.MessageHandler.handle(MessageHandler.java:45) at com.microsoft.azure.functions.worker.JavaWorkerClient$StreamingMessagePeer.lambda$onNext$0(JavaWorkerClient.java:92) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ... 1 more

sethdarragile6 commented 2 years ago

some related incidents:

https://prime-cdc.pagerduty.com/incidents/Q1MONG32Z6349T https://prime-cdc.pagerduty.com/incidents/Q2YS38S4A203BF

jimduff-usds commented 1 year ago

This occurred again on Jan 2, 2023 at 7:39:12pm ET

The error

com.azure.storage.blob.models.BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobAlreadyExists</Code><Message>The specified blob already exists.
RequestId:8ddcefcc-501e-0043-610b-1f4453000000
        at com.azure.storage.blob.BlobClient.upload(BlobClient.java:192)
        at gov.cdc.prime.router.azure.BlobAccess$Companion.uploadBlob$prime_router(BlobAccess.kt:127)

Other Evidence Just Prior

Immediately prior to that exception is this message in the logs:

gov.cdc.prime.router.azure.DatabaseAccess - Inserted row into ACTION: action_name=process_warning, params= process&PROCESS&2b0f7c0a-15ab-4776-967d-d43e4d6a563b&SkipInvalidItems&&&, action_id=22274601

Report 2b0f7c0a-15ab-4776-967d-d43e4d6a563b is from Sender simple_report.default and has 6 rows.

Also this error one second earlier, at 7:39:11:

[1;31m00:39:11.922 [pool-2-thread-93][297] ERROR gov.cdc.prime.router.azure.ProcessFunction - Process function exception for event: process&PROCESS&2b0f7c0a-15ab-4776-967d-d43e4d6a563b&SkipInvalidItems&&&[m
 [1;31m00:39:11.905 [pool-2-thread-93][297] ERROR gov.cdc.prime.router.azure.WorkflowEngine - Got exception while dispatching to schema covid-19, and rcvr ca-dph.elr-saphire[m

And a few seconds earlier at 7:39:07, this error:

java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.

The recovery

Seven seconds later the process step ran again, on this same input report (2b0f7...) and that run succeeded. (The process step will retry several times in a row before giving up)

You can see the children created by the two distinct runs, several seconds apart, if you run this query:

select action_id, report_id, created_at, receiving_org, receiving_org_svc, item_count, next_action from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '2b0f7c0a-15ab-4776-967d-d43e4d6a563b') order by created_at;

However, if you query on those two action_id's, you'll see that only the second one created further 'grandchildren' reports. This means that the process step failed after putting rows in the database, but before putting actions onto the queue.

And this query shows that this was the only process failure around that time: select action_name, created_at from action where action_id > 22204590 and action_name = 'process_warning';

CONCLUSION ONE:

Ignoring for a second the issue of why the blob naming collision is occurring, I did determine, at least in this case, that the system recovered and reprocessed, with proper process-once semantics.

jimduff-usds commented 1 year ago

Looking further at the above issue:

Here are the entries for the process that failed and the process that succeeded:

select action_id, report_id, created_at, receiving_org, receiving_org_svc, item_count, next_action from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '2b0f7c0a-15ab-4776-967d-d43e4d6a563b') order by created_at;

 action_id |              report_id               |          created_at          | receiving_org | receiving_org_svc | item_count | next_action 
-----------+--------------------------------------+------------------------------+---------------+-------------------+------------+-------------
  22274601 | 39e1172b-5bf4-43c1-b97d-3dc30a00538e | 2023-01-03 00:39:11.93297+00 | nj-doh        | elr               |          1 | batch
  22274601 | c624fb3c-ffd4-4df2-95cf-126dc12f2a76 | 2023-01-03 00:39:11.93297+00 | co-phd        | elr-hl7           |          1 | batch
  22274601 | c5dcd230-329b-4e25-8b3c-def32d87d09e | 2023-01-03 00:39:11.93297+00 | tx-doh        | elr               |          2 | batch
  22274601 | 930093af-4bde-434d-8be8-658ca207143e | 2023-01-03 00:39:11.93297+00 | co-phd        | elr-test          |          0 | 
  22274601 | f4f54a03-a442-4d86-8026-d45527d33f5e | 2023-01-03 00:39:11.93297+00 | ms-doh        | elr               |          1 | batch
  22274601 | b726fd9e-628b-4ffd-8a43-660df3f7d9a3 | 2023-01-03 00:39:11.93297+00 | ca-dph        | elr-test          |          0 | 
  22274601 | 60a32ffc-2c11-4e85-8c8c-09a51a0f7f12 | 2023-01-03 00:39:11.93297+00 | ca-dph        | elr               |          0 | 
  22274601 | f64706b9-3de3-46ce-a93b-5214475390ec | 2023-01-03 00:39:11.93297+00 | nj-doh        | elr-test          |          1 | batch
  22274603 | 170dd0e1-cbc1-48c1-bdd6-b45db3016669 | 2023-01-03 00:39:18.05808+00 | ms-doh        | elr               |          1 | batch
  22274603 | 1eceab23-5623-4276-96c4-5a34fc73ce45 | 2023-01-03 00:39:18.05808+00 | nj-doh        | elr               |          1 | batch
  22274603 | 5fb32fbb-ee12-4da7-a4dc-16402b8f08df | 2023-01-03 00:39:18.05808+00 | co-phd        | elr-hl7           |          1 | batch
  22274603 | 03ec8c2c-107c-407e-93a3-a47ee88395f4 | 2023-01-03 00:39:18.05808+00 | ca-dph        | elr               |          0 | 
  22274603 | ca8559e3-6317-4e7d-a1a9-2f1b88f18af0 | 2023-01-03 00:39:18.05808+00 | ca-dph        | elr-test          |          0 | 
  22274603 | 58dfd5af-de97-4e60-a05c-c5e76e6ed6e0 | 2023-01-03 00:39:18.05808+00 | co-phd        | elr-test          |          0 | 
  22274603 | 88ce4f06-95ff-440c-8eb7-974fcbe7c703 | 2023-01-03 00:39:18.05808+00 | nj-doh        | elr-test          |          1 | batch
  22274603 | a26dc02c-1413-47a9-9fd1-aada986298bf | 2023-01-03 00:39:18.05808+00 | tx-doh        | elr               |          2 | batch
  22274603 | f4c2e7d4-7038-4659-9cb8-760a4740aac4 | 2023-01-03 00:39:18.05808+00 | ca-dph        | elr-saphire       |          1 | batch

Note that the second successful run generated data for ca-dph.elr-saphire, while the first failed run did not!!

The body_url is of the form:

\reports/batch%2Fca-dph.elr-saphire%2Fcovid-19-f4c2e7d4-7038-4659-9cb8-760a4740aac4-20230103003918.internal.csv

Note that the timestamp is only down to seconds: 2023-01-03 00:39:18, however it includes the report_id, which is unique.

So I'm having a very hard time seeing how a duplicate blob name could occur. I'm suspicious of that timeout - maybe that manifests itself as a BlobAlreadyExists error?

BlobAccess.kt::uploadBlob has log messages before ("Starting..." and after ("Done...") every upload. Looking in the log from around this time there is this Starting with no corresponding Done:

00:38:07.694 [pool-2-thread-93][297] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/ca-dph.elr-saphire/covid-19-290e82d4-459a-469f-921d-9b5bfbc286e4-20230103003807.internal.csv

Note that this occurred at 7:38:07 ET, exactly 60 seconds before the error occurred. Also note the timeout logged (See above). Also, there is no evidence of any other blob with that name getting uploaded. That report_id never occurs in the report_file table.

CONCLUSION 2:

Based on this evidence, it seems quite possible that the BlobAlreadyExists error is a red herring, and the real error is some kind of timeout or other glitch, that weirdly manifests itself as a duplicate blob error.

jimduff-usds commented 1 year ago

This bug occurred again on Feb 12, 2023. A process failure. Same pattern of a timeout, then a BlobAlreadyExists error. Again, I suspect this is not really a BlobAlreadyExists error, but do not have definitive provof.

The main thing is that the process step re-ran successfully a few seconds later. Here's the timeline, from the logs and database:

TImeline

22:05:27 Receive step ran fine. This report had been received at February 12, 2023, 5:05:27.236 PM, from cue.

22:05:30 The Process step wakes up 22:05:30.418 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.ProcessFunction - Process message: process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&& - {}

Then nothing happens for 20 seconds

22:05:50: Filtering occurs, and it starts to create the Blob for hhsprotect: 22:05:50.046 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/hhsprotect.elr/hhsprotect-covid-19-ed538298-0b45-468e-ac9e-397648e96b63-20230212220550.internal.csv - {}

The TIMEOUT: Then nothing happens for 1 minute

22:06:50: A TIMEOUT: [33m22:06:50.116 [reactor-http-epoll-2][127] WARN reactor.netty.http.client.HttpClientConnect - [6ad8e151-5, L:/169.254.254.12:43140 - R:pdhprodstorageaccount.blob.core.windows.net/172.17.7.78:443] The connection observed an error[m java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.

22:06:54: [1;31m22:06:54.167 [pool-2-thread-6055][276877] ERROR gov.cdc.prime.router.azure.WorkflowEngine - Got exception while dispatching to schema hhsprotect/hhsprotect-covid-19, and rcvr hhsprotect.elr[m [1;31m22:06:54.172 [pool-2-thread-6055][276877] ERROR gov.cdc.prime.router.azure.ProcessFunction - Process function exception for event: process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&&[m com.azure.storage.blob.models.BlobStorageException: Status code 409, "?<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExistsThe specified blob already exists.

22:06:54: PROCESS_WARNING: That previous process_warning action finished at 22:06:54, The process at 22:06:54 was a process_warning action, with this message in the database: "Failed to process message process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&& 1 times, setting status to process_warning.

22:06:54: Exception while executing function: Functions.process Result: Failure Exception: BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExistsThe specified blob already exists. RequestId:1a239f15-301e-0037-802e-3f70a3000000 Time:2023-02-12T22:06:54.1240883Z"

Then a new process action ran and finished without error at 22:06:59. The first did not have any entry for hhsprotect, the second one did.

Queries

Query to see all the child reports of the one originally received. You can see it was run twice.

select * from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '9c0bf791-3b2e-44b4-820f-7b2ab9843f87') order by action_id, created_at;

Query to see the two process actions - one failed, one succeeded.

select * from action where action_id in (23079777, 23079780);

Other evidence of the red herringness of the BlobAlreadyExists error:

There is no evidence of two blobUploads occurring with the same filename. That is, this filename only occurs once in the logs:

22:05:50.046 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/hhsprotect.elr/hhsprotect-covid-19-ed538298-0b45-468e-ac9e-397648e96b63-20230212220550.internal.csv - {}

jimduff-usds commented 1 year ago

To @arnejduranovic 's needs-info tag, I'd say this is a real issue ,but has only occurred thrice in the last 30 days, and solved itself with a re-run of process, so let's keep in the Icebox. If it recurs in large numbers, let's re-look at it. Maybe next step would be a ticket with Microsoft?

CDCgov / prime-reportstream