Open justadreamer opened 3 weeks ago
Change 2 is breaking the existing API:
and enclosing mechanisms:
thus it was left out of the patch for the current v4.4 and postponed to version/4.5.
Version 4.4 thus includes only shut-off and recovery mechanisms when there was a certain number of request failures that happened within a certain time window. The recovery period (when we don't send any requests to the cloud, allowing it to recover), the number of requests that need to fail and the time window within which we need these failures to happen to enter recovery are configuration parameters that were added. A patch to the specification thus is required to
https://github.com/51Degrees/specifications/blob/main/pipeline-specification/pipeline-elements/cloud-request-engine.md - please create a PR with the changes to the above file, containing the description of the recovery feature and the configuration parameters.
Summary of the Feature implemented within 4.4 version
If a response from the cloud server is delayed (e.g., due to network issues), it can slow down the client system, potentially causing timeouts. This may lead to consumer requests getting stuck (e.g., waiting for initialization requests or device detection), resulting in poor user experience and possible exhaustion of server resources (e.g., RAM or socket connections).
To prevent this, if a significant number of requests fail within a short time, the CloudRequestEngine
can enter a "recovery period". During this time, it skips sending any requests to the cloud server and immediately signals the temporary unavailability of the CloudRequestEngine
by throwing a specific exception. For ASP.NET Framework integration, this exception is caught and suppressed—similar to the effect of SuppressProcessExceptions
—allowing FlowData
to be processed without usable device data, but with error information.
This behavior of the CloudRequestEngine
is controlled by the following configuration parameters:
FailuresToEnterRecovery
(default: 10): The number of failed cloud requests within the time window defined by FailuresWindowSeconds
that will trigger the recovery period.FailuresWindowSeconds
(default: 100): The time window (in seconds) in which the number of failed requests must reach FailuresToEnterRecovery
to trigger the recovery period.RecoverySeconds
(default: 60.0): The duration (in seconds) of the recovery period. Setting this to zero or a negative value disables the recovery mechanism.
Motivation
Robustness, non-disruption to the integrating service operation.
Background
When
cloud.51degrees.com
is unavailable (f.e. behind firewall, or there are other errors like resource key expired)CloudRequestEngine
may cause a resource exhaustion on the IIS server and make it return 503 Service Unavailable - for all threads will be stuck awaiting on it while some requests fail or time out.The causes of threads getting stuck are multiple:
CloudRequestEngine
- that may cause one thread await for response and others block on these locksObjectives
Cloud / Framework-Web
example - try to deploy to the IIS specify some limited number of worker threads, limited length of the Queue in Application Pool and make it a subject to load testingaccessibleproperties
andevidencefilter
initialization asynchronous, triggered on object construction (a draft patch from James for this) - if properties have not been fetched - we can not initialize Pipeline - check if there is any recovery mechanism for this (reattempts to initialize properties), also please check if access to Task<> properties is synchronizedcc: @BohdanVV