aws / amazon-ecs-service-connect-agent

Amazon ECS Service Connect Agent
Apache License 2.0
27 stars 10 forks source link

Change SC agent health compute logic to health flip #49

Closed Penghaow closed 9 months ago

Penghaow commented 11 months ago

Summary

The AppNet agent currently monitors its connection to the EMS, marking itself as unhealthy if it disconnects from the control plane EMS for 3 hours. However, if the relay container exits and fails to restart, possibly due to resource constraints, the AppNet agent loses its connection to the EMS. Yet, the task will only be marked as unhealthy after a grace period of 3 hours.

This change is to change the current SC health check compute logic to health flip with a initial threshold of 10 flip within 30 minutes to detect the failure impact of relay agent earlier.

SIM: https://sim.amazon.com/issues/LATTICE-BE-10167

Implementation details

FlipTimeStamps measures the connection between the Envoy proxy and EMS. It keeps tracks of the times when the connection status flips. TimeStamps older than 30 minutes are removed. If the length of FlipTimeStamps reaches or exceeds the threshold (10 flip within 30 minutes), the health status will be marked as Unhealthy. Else it will remain healthy.

Testing

New tests cover the changes: yes Manual build succeeded at local.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.