51Degrees / pipeline-dotnet

51Degrees Pipeline for .NET
Other
0 stars 2 forks source link

CloudRequestEngine graceful non-blocking degradation #132

Open justadreamer opened 3 weeks ago

justadreamer commented 3 weeks ago

Motivation

Robustness, non-disruption to the integrating service operation.

Background

When cloud.51degrees.com is unavailable (f.e. behind firewall, or there are other errors like resource key expired) CloudRequestEngine may cause a resource exhaustion on the IIS server and make it return 503 Service Unavailable - for all threads will be stuck awaiting on it while some requests fail or time out.

The causes of threads getting stuck are multiple:

Objectives

cc: @BohdanVV

justadreamer commented 2 weeks ago

Change 2 is breaking the existing API:

and enclosing mechanisms:

thus it was left out of the patch for the current v4.4 and postponed to version/4.5.

Version 4.4 thus includes only shut-off and recovery mechanisms when there was a certain number of request failures that happened within a certain time window. The recovery period (when we don't send any requests to the cloud, allowing it to recover), the number of requests that need to fail and the time window within which we need these failures to happen to enter recovery are configuration parameters that were added. A patch to the specification thus is required to

https://github.com/51Degrees/specifications/blob/main/pipeline-specification/pipeline-elements/cloud-request-engine.md - please create a PR with the changes to the above file, containing the description of the recovery feature and the configuration parameters.

justadreamer commented 2 weeks ago

Summary of the Feature implemented within 4.4 version

If a response from the cloud server is delayed (e.g., due to network issues), it can slow down the client system, potentially causing timeouts. This may lead to consumer requests getting stuck (e.g., waiting for initialization requests or device detection), resulting in poor user experience and possible exhaustion of server resources (e.g., RAM or socket connections).

To prevent this, if a significant number of requests fail within a short time, the CloudRequestEngine can enter a "recovery period". During this time, it skips sending any requests to the cloud server and immediately signals the temporary unavailability of the CloudRequestEngine by throwing a specific exception. For ASP.NET Framework integration, this exception is caught and suppressed—similar to the effect of SuppressProcessExceptions—allowing FlowData to be processed without usable device data, but with error information.

This behavior of the CloudRequestEngine is controlled by the following configuration parameters: