Open jimduff-usds opened 2 years ago
Exception:
Exception while executing function: Functions.process Result: Failure
Exception: BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExists
BlobAlreadyExists
This occurred again on Jan 2, 2023 at 7:39:12pm ET
com.azure.storage.blob.models.BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobAlreadyExists</Code><Message>The specified blob already exists.
RequestId:8ddcefcc-501e-0043-610b-1f4453000000
at com.azure.storage.blob.BlobClient.upload(BlobClient.java:192)
at gov.cdc.prime.router.azure.BlobAccess$Companion.uploadBlob$prime_router(BlobAccess.kt:127)
Immediately prior to that exception is this message in the logs:
gov.cdc.prime.router.azure.DatabaseAccess - Inserted row into ACTION: action_name=process_warning, params= process&PROCESS&2b0f7c0a-15ab-4776-967d-d43e4d6a563b&SkipInvalidItems&&&, action_id=22274601
Report 2b0f7c0a-15ab-4776-967d-d43e4d6a563b
is from Sender simple_report.default
and has 6 rows.
Also this error one second earlier, at 7:39:11:
[1;31m00:39:11.922 [pool-2-thread-93][297] ERROR gov.cdc.prime.router.azure.ProcessFunction - Process function exception for event: process&PROCESS&2b0f7c0a-15ab-4776-967d-d43e4d6a563b&SkipInvalidItems&&&[m
[1;31m00:39:11.905 [pool-2-thread-93][297] ERROR gov.cdc.prime.router.azure.WorkflowEngine - Got exception while dispatching to schema covid-19, and rcvr ca-dph.elr-saphire[m
And a few seconds earlier at 7:39:07, this error:
java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.
Seven seconds later the process step ran again, on this same input report (2b0f7...) and that run succeeded. (The process step will retry several times in a row before giving up)
You can see the children created by the two distinct runs, several seconds apart, if you run this query:
select action_id, report_id, created_at, receiving_org, receiving_org_svc, item_count, next_action from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '2b0f7c0a-15ab-4776-967d-d43e4d6a563b') order by created_at;
However, if you query on those two action_id's, you'll see that only the second one created further 'grandchildren' reports. This means that the process step failed after putting rows in the database, but before putting actions onto the queue.
And this query shows that this was the only process failure around that time:
select action_name, created_at from action where action_id > 22204590 and action_name = 'process_warning';
Ignoring for a second the issue of why the blob naming collision is occurring, I did determine, at least in this case, that the system recovered and reprocessed, with proper process-once semantics.
Here are the entries for the process
that failed and the process
that succeeded:
select action_id, report_id, created_at, receiving_org, receiving_org_svc, item_count, next_action from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '2b0f7c0a-15ab-4776-967d-d43e4d6a563b') order by created_at;
action_id | report_id | created_at | receiving_org | receiving_org_svc | item_count | next_action
-----------+--------------------------------------+------------------------------+---------------+-------------------+------------+-------------
22274601 | 39e1172b-5bf4-43c1-b97d-3dc30a00538e | 2023-01-03 00:39:11.93297+00 | nj-doh | elr | 1 | batch
22274601 | c624fb3c-ffd4-4df2-95cf-126dc12f2a76 | 2023-01-03 00:39:11.93297+00 | co-phd | elr-hl7 | 1 | batch
22274601 | c5dcd230-329b-4e25-8b3c-def32d87d09e | 2023-01-03 00:39:11.93297+00 | tx-doh | elr | 2 | batch
22274601 | 930093af-4bde-434d-8be8-658ca207143e | 2023-01-03 00:39:11.93297+00 | co-phd | elr-test | 0 |
22274601 | f4f54a03-a442-4d86-8026-d45527d33f5e | 2023-01-03 00:39:11.93297+00 | ms-doh | elr | 1 | batch
22274601 | b726fd9e-628b-4ffd-8a43-660df3f7d9a3 | 2023-01-03 00:39:11.93297+00 | ca-dph | elr-test | 0 |
22274601 | 60a32ffc-2c11-4e85-8c8c-09a51a0f7f12 | 2023-01-03 00:39:11.93297+00 | ca-dph | elr | 0 |
22274601 | f64706b9-3de3-46ce-a93b-5214475390ec | 2023-01-03 00:39:11.93297+00 | nj-doh | elr-test | 1 | batch
22274603 | 170dd0e1-cbc1-48c1-bdd6-b45db3016669 | 2023-01-03 00:39:18.05808+00 | ms-doh | elr | 1 | batch
22274603 | 1eceab23-5623-4276-96c4-5a34fc73ce45 | 2023-01-03 00:39:18.05808+00 | nj-doh | elr | 1 | batch
22274603 | 5fb32fbb-ee12-4da7-a4dc-16402b8f08df | 2023-01-03 00:39:18.05808+00 | co-phd | elr-hl7 | 1 | batch
22274603 | 03ec8c2c-107c-407e-93a3-a47ee88395f4 | 2023-01-03 00:39:18.05808+00 | ca-dph | elr | 0 |
22274603 | ca8559e3-6317-4e7d-a1a9-2f1b88f18af0 | 2023-01-03 00:39:18.05808+00 | ca-dph | elr-test | 0 |
22274603 | 58dfd5af-de97-4e60-a05c-c5e76e6ed6e0 | 2023-01-03 00:39:18.05808+00 | co-phd | elr-test | 0 |
22274603 | 88ce4f06-95ff-440c-8eb7-974fcbe7c703 | 2023-01-03 00:39:18.05808+00 | nj-doh | elr-test | 1 | batch
22274603 | a26dc02c-1413-47a9-9fd1-aada986298bf | 2023-01-03 00:39:18.05808+00 | tx-doh | elr | 2 | batch
22274603 | f4c2e7d4-7038-4659-9cb8-760a4740aac4 | 2023-01-03 00:39:18.05808+00 | ca-dph | elr-saphire | 1 | batch
Note that the second successful run generated data for ca-dph.elr-saphire, while the first failed run did not!!
The body_url is of the form:
\reports/batch%2Fca-dph.elr-saphire%2Fcovid-19-f4c2e7d4-7038-4659-9cb8-760a4740aac4-20230103003918.internal.csv
Note that the timestamp is only down to seconds: 2023-01-03 00:39:18
, however it includes the report_id, which is unique.
So I'm having a very hard time seeing how a duplicate blob name could occur. I'm suspicious of that timeout - maybe that manifests itself as a BlobAlreadyExists error?
BlobAccess.kt::uploadBlob has log messages before ("Starting..." and after ("Done...") every upload. Looking in the log from around this time there is this Starting with no corresponding Done:
00:38:07.694 [pool-2-thread-93][297] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/ca-dph.elr-saphire/covid-19-290e82d4-459a-469f-921d-9b5bfbc286e4-20230103003807.internal.csv
Note that this occurred at 7:38:07 ET, exactly 60 seconds before the error occurred. Also note the timeout logged (See above). Also, there is no evidence of any other blob with that name getting uploaded. That report_id never occurs in the report_file table.
Based on this evidence, it seems quite possible that the BlobAlreadyExists error is a red herring, and the real error is some kind of timeout or other glitch, that weirdly manifests itself as a duplicate blob error.
This bug occurred again on Feb 12, 2023. A process failure. Same pattern of a timeout, then a BlobAlreadyExists error. Again, I suspect this is not really a BlobAlreadyExists error, but do not have definitive provof.
The main thing is that the process step re-ran successfully a few seconds later. Here's the timeline, from the logs and database:
22:05:27 Receive step ran fine. This report had been received at February 12, 2023, 5:05:27.236 PM, from cue.
22:05:30 The Process step wakes up 22:05:30.418 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.ProcessFunction - Process message: process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&& - {}
Then nothing happens for 20 seconds
22:05:50: Filtering occurs, and it starts to create the Blob for hhsprotect: 22:05:50.046 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/hhsprotect.elr/hhsprotect-covid-19-ed538298-0b45-468e-ac9e-397648e96b63-20230212220550.internal.csv - {}
The TIMEOUT: Then nothing happens for 1 minute
22:06:50: A TIMEOUT: [33m22:06:50.116 [reactor-http-epoll-2][127] WARN reactor.netty.http.client.HttpClientConnect - [6ad8e151-5, L:/169.254.254.12:43140 - R:pdhprodstorageaccount.blob.core.windows.net/172.17.7.78:443] The connection observed an error[m java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.
22:06:54: [1;31m22:06:54.167 [pool-2-thread-6055][276877] ERROR gov.cdc.prime.router.azure.WorkflowEngine - Got exception while dispatching to schema hhsprotect/hhsprotect-covid-19, and rcvr hhsprotect.elr[m
[1;31m22:06:54.172 [pool-2-thread-6055][276877] ERROR gov.cdc.prime.router.azure.ProcessFunction - Process function exception for event: process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&&[m
com.azure.storage.blob.models.BlobStorageException: Status code 409, "?<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExists
22:06:54: PROCESS_WARNING: That previous process_warning action finished at 22:06:54, The process at 22:06:54 was a process_warning action, with this message in the database: "Failed to process message process&PROCESS&9c0bf791-3b2e-44b4-820f-7b2ab9843f87&None&&& 1 times, setting status to process_warning.
22:06:54: Exception while executing function: Functions.process Result: Failure
Exception: BlobStorageException: Status code 409, "<?xml version="1.0" encoding="utf-8"?>BlobAlreadyExists
Then a new process action ran and finished without error at 22:06:59. The first did not have any entry for hhsprotect, the second one did.
select * from report_file where report_id in (select child_report_id from report_lineage where parent_report_id = '9c0bf791-3b2e-44b4-820f-7b2ab9843f87') order by action_id, created_at;
select * from action where action_id in (23079777, 23079780);
There is no evidence of two blobUploads occurring with the same filename. That is, this filename only occurs once in the logs:
22:05:50.046 [pool-2-thread-6055][276877] INFO gov.cdc.prime.router.azure.BlobAccess - Starting uploadBlob of batch/hhsprotect.elr/hhsprotect-covid-19-ed538298-0b45-468e-ac9e-397648e96b63-20230212220550.internal.csv - {}
To @arnejduranovic 's needs-info
tag, I'd say this is a real issue ,but has only occurred thrice in the last 30 days, and solved itself with a re-run of process
, so let's keep in the Icebox. If it recurs in large numbers, let's re-look at it. Maybe next step would be a ticket with Microsoft?
Describe the bug ProcessFunction failed with a BlobAlreadyExists exception
Impact Likely, one or more reports failed to complete processing.
To Reproduce Steps to reproduce the behavior:
Expected behavior This error should never occur. I'm pretty sure I haven't seen this before. I haven't looked at the code recently, but doesn't the Process function only produce internal report files, which should be given filenames with UUIDs embedded, and therefore always be unique??
Perplexing. I wonder if this is somehow a race condition between multiple attempts to process?
Logs The exception log is in the comment, below.