CloudRequestEngine graceful non-blocking degradation

51Degrees / pipeline-dotnet

51Degrees Pipeline for .NET

Other

0 stars 2 forks source link

Background

When cloud.51degrees.com is unavailable (f.e. behind firewall, or there are other errors like resource key expired) CloudRequestEngine may cause a resource exhaustion on the IIS server and make it return 503 Service Unavailable - for all threads will be stuck awaiting on it while some requests fail or time out.

The causes of threads getting stuck are multiple:

there are 2 double-checked locks in CloudRequestEngine - that may cause one thread await for response and others block on these locks

the device detection API request has a timeout (of 2 seconds by default?) - please check, but if all threads are stuck on this timeout and there are more incoming requests than threads - we will end up all threads waiting and some requests may never be served - IIS would respond with 503 potentially, but needs to be checked

Objectives

[ ] 1. build a reproduction scenario using the Cloud / Framework-Web example - try to deploy to the IIS specify some limited number of worker threads, limited length of the Queue in Application Pool and make it a subject to load testing

[ ] 2. get rid of the double-checked locks and make accessibleproperties and evidencefilter initialization asynchronous, triggered on object construction (a draft patch from James for this) - if properties have not been fetched - we can not initialize Pipeline - check if there is any recovery mechanism for this (reattempts to initialize properties), also please check if access to Task<> properties is synchronized

[ ] 3. add a backoff logic for any API request: that if one request has failed - we memorize the timestamp and do not attempt to make any new requests for the RecoveryPeriod from that timestamp (can be configurable, default can be 2 seconds)

cc: @BohdanVV

Change 2 is breaking the existing API:

requirement to pass cancellation tokens of the whole server and individual requests.

and enclosing mechanisms:

when pipeline has to initialize evidence keys for all flowelements and the call for evidence keys has failed the first time - we need a way to reattempt loading them

thus it was left out of the patch for the current v4.4 and postponed to version/4.5.

Version 4.4 thus includes only shut-off and recovery mechanisms when there was a certain number of request failures that happened within a certain time window. The recovery period (when we don't send any requests to the cloud, allowing it to recover), the number of requests that need to fail and the time window within which we need these failures to happen to enter recovery are configuration parameters that were added. A patch to the specification thus is required to

describe this new logic
describe the configuration parameters with a note that currently only .NET supports this feature

https://github.com/51Degrees/specifications/blob/main/pipeline-specification/pipeline-elements/cloud-request-engine.md - please create a PR with the changes to the above file, containing the description of the recovery feature and the configuration parameters.

Summary of the Feature implemented within 4.4 version

If a response from the cloud server is delayed (e.g., due to network issues), it can slow down the client system, potentially causing timeouts. This may lead to consumer requests getting stuck (e.g., waiting for initialization requests or device detection), resulting in poor user experience and possible exhaustion of server resources (e.g., RAM or socket connections).

To prevent this, if a significant number of requests fail within a short time, the CloudRequestEngine can enter a "recovery period". During this time, it skips sending any requests to the cloud server and immediately signals the temporary unavailability of the CloudRequestEngine by throwing a specific exception. For ASP.NET Framework integration, this exception is caught and suppressed—similar to the effect of SuppressProcessExceptions—allowing FlowData to be processed without usable device data, but with error information.

This behavior of the CloudRequestEngine is controlled by the following configuration parameters:

FailuresToEnterRecovery (default: 10): The number of failed cloud requests within the time window defined by FailuresWindowSeconds that will trigger the recovery period.
FailuresWindowSeconds (default: 100): The time window (in seconds) in which the number of failed requests must reach FailuresToEnterRecovery to trigger the recovery period.
RecoverySeconds (default: 60.0): The duration (in seconds) of the recovery period. Setting this to zero or a negative value disables the recovery mechanism.

51Degrees / pipeline-dotnet

CloudRequestEngine graceful non-blocking degradation #132

Motivation

Background

Objectives