GLEIF-IT / reg-pilot

A project to manage reg-pilot related issues
1 stars 2 forks source link

How do we occasionally test deployed instances and report errors? #52

Open 2byrds opened 4 weeks ago

2byrds commented 4 weeks ago

For instance our witnesses, api, and verifier are deployed to dev an test. But do we test it daily/automatically to determine if they are healthy?

ronakseth96 commented 3 weeks ago

We have implemented most of these things and are in the final step of setting up email alerts.

Service Health Checks:
Most of these services are currently set up with health checks that monitor their operational status. These health checks are configured, which examine the services at 5-second intervals to verify they are functioning as expected. In case the service becomes unhealthy, the copilot will trigger an automatic restart to minimize downtime and restore service.

Autoscaling setup:
The test witness service is now configured with autoscaling, allowing it to dynamically scale between a set range of tasks, currently set to 1 and 2. The triggers are presently set up based on CPU & memory usage, with certain thresholds, so the service scales up automatically during increased load and scales down when the load decreases.

CloudWatch Monitoring/Alarms:
Besides health checks and autoscaling, we are utilizing AWS CloudWatch to monitor key performance metrics such as CPU and memory usage. A CloudWatch dashboard has been set up for the test witness service, and alarms are configured to trigger when certain thresholds are crossed, and which will help us manage performance.


Automated Alerts:
The final thing is setting up automated alerts that will notify us via email when an alarm is activated. And would allow us to identify and address any potential service disruptions or performance issues.

2byrds commented 2 weeks ago

@ronakseth96 thank you for the synopsis! Can you create the necessary follow-on issues and make sure they are in the reg-pilot project.