The AppNet agent currently monitors its connection to the EMS, marking itself as unhealthy if it disconnects from the control plane EMS for 3 hours. However, if the relay container exits and fails to restart, possibly due to resource constraints, the AppNet agent loses its connection to the EMS. Yet, the task will only be marked as unhealthy after a grace period of 3 hours.
This change is to change the current SC health check compute logic to health flip with a initial threshold of 10 flip within 30 minutes to detect the failure impact of relay agent earlier.
FlipTimeStamps measures the connection between the Envoy proxy and EMS. It keeps tracks of the times when the connection status flips. TimeStamps older than 30 minutes are removed. If the length of FlipTimeStamps reaches or exceeds the threshold (10 flip within 30 minutes), the health status will be marked as Unhealthy. Else it will remain healthy.
Testing
New tests cover the changes: yes
Manual build succeeded at local.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Summary
The AppNet agent currently monitors its connection to the EMS, marking itself as unhealthy if it disconnects from the control plane EMS for 3 hours. However, if the relay container exits and fails to restart, possibly due to resource constraints, the AppNet agent loses its connection to the EMS. Yet, the task will only be marked as unhealthy after a grace period of 3 hours.
This change is to change the current SC health check compute logic to health flip with a initial threshold of
10 flip within 30 minutes
to detect the failure impact of relay agent earlier.SIM: https://sim.amazon.com/issues/LATTICE-BE-10167
Implementation details
FlipTimeStamps measures the connection between the Envoy proxy and EMS. It keeps tracks of the times when the connection status flips. TimeStamps older than 30 minutes are removed. If the length of FlipTimeStamps reaches or exceeds the threshold (
10 flip within 30 minutes
), the health status will be marked asUnhealthy
. Else it will remain healthy.Testing
New tests cover the changes: yes Manual build succeeded at local.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.