CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
73 stars 40 forks source link

Investigate causes of 503 responses #5300

Closed cwinters-usds closed 2 years ago

JosiahSiegel commented 2 years ago

Issue:

Screenshot from 2022-04-18 12-36-49

Log:

AzureDiagnostics
| where (httpStatusCode_s == '503')
    and backendHostname_s startswith 'pdhprod-functionapp'
| where TimeGenerated >= todatetime("4/16/2022") and TimeGenerated <= todatetime("4/19/2022")
| summarize count() by Category, ErrorInfo_s

image

Microsoft documented cause/solution:

Note: this solution appears to be for newer front door endpoint options and/or NOT CDN If requests going through Azure Front Door result in a 503 error response code, configure ❗Origin response timeout❗ (in seconds) for the endpoint. You can extend the default timeout to up to 4 minutes, which is 240 seconds. To configure the setting, go to Endpoint manager and select Edit endpoint.

OR

If the timeout doesn't resolve the issue, use a tool like Fiddler or your browser's developer tool to check if the client is sending byte range requests with Accept-Encoding headers. Using this option leads to the origin responding with different content lengths.

If the client is sending byte range requests with Accept-Encoding headers, you have two options. You can disable compression on the origin/Azure Front Door. Or you can create a rules set rule to remove Accept-Encoding from the request for byte range requests.

Community:

Multiple users have reported this issue happening at the same time: image

Their issue is extremely similar to ours:

Conclusion

Based on findings so far, the FrontDoor 503 error is caused by "timeTaken" of the request exceeding the seconds specified by the send/receive timeout value. This indicates that a response was not received by the front door before the request was terminated via the front door timeout. It is likely that something on the Azure end prevented the response from being received in less than 90 seconds.

AzureDiagnostics
| where backendHostname_s startswith 'pdhprod-functionapp'
| where TimeGenerated >= todatetime("4/16/2022") and TimeGenerated <= todatetime("4/19/2022")
| summarize count(), round(avg(todecimal(timeTaken_s)),3) by Category, backendHostname_s, OperationName, ErrorInfo_s

image

image

JosiahSiegel commented 2 years ago

Follow the failure through the logs.

It appears that the function app returns successfully quickly, while the front door does not pick up the successful response.

AzureDiagnostics
| where TimeGenerated >= todatetime('2022-04-16T18:43:35.935Z') and TimeGenerated <= todatetime('2022-04-16T18:43:41')
    and httpStatusCode_s == "503"
| project ResourceType, requestUri_s, timeTaken_s, backendHostname_s, OperationName, ErrorInfo_s

image

FunctionAppLogs 
| where TimeGenerated >= todatetime('2022-04-16T18:43:35.935Z') and TimeGenerated <= todatetime('2022-04-16T18:43:41')
 and FunctionInvocationId == "b3aad6c9-8e9d-403b-a174-ee138eec401a"

image

requests
| where timestamp >= todatetime('2022-04-16T18:43:35.935Z') and timestamp <= todatetime('2022-04-16T18:43:41')
| extend seconds=duration / 1000
| where customDimensions.InvocationId == "b3aad6c9-8e9d-403b-a174-ee138eec401a"

image image