How do we occasionally test deployed instances and report errors?

2byrds commented 2 months ago

For instance our witnesses, api, and verifier are deployed to dev an test. But do we test it daily/automatically to determine if they are healthy?

ronakseth96 commented 2 months ago

We have implemented most of these things and are in the final step of setting up email alerts.

Service Health Checks: Most of these services are currently set up with health checks that monitor their operational status. These health checks are configured, which examine the services at 5-second intervals to verify they are functioning as expected. In case the service becomes unhealthy, the copilot will trigger an automatic restart to minimize downtime and restore service.

Autoscaling setup: The test witness service is now configured with autoscaling, allowing it to dynamically scale between a set range of tasks, currently set to 1 and 2. The triggers are presently set up based on CPU & memory usage, with certain thresholds, so the service scales up automatically during increased load and scales down when the load decreases.

CloudWatch Monitoring/Alarms: Besides health checks and autoscaling, we are utilizing AWS CloudWatch to monitor key performance metrics such as CPU and memory usage. A CloudWatch dashboard has been set up for the test witness service, and alarms are configured to trigger when certain thresholds are crossed, and which will help us manage performance. 

Automated Alerts: The final thing is setting up automated alerts that will notify us via email when an alarm is activated. And would allow us to identify and address any potential service disruptions or performance issues.

2byrds commented 2 months ago

@ronakseth96 thank you for the synopsis! Can you create the necessary follow-on issues and make sure they are in the reg-pilot project.

ronakseth96 commented 1 month ago

updates with reference to the service autoscaling, monitoring, and alerts:

Autoscaling setup: Based on the recent evaluations, the autoscaling configuration has also been implemented for the verification and api services in the dev domain. This setup enables dynamic scaling between 1 and 2 tasks and is triggered by predefined CPU and memory usage thresholds. Following a thorough review with no issues, the same setup was also extended to the test domain.
CloudWatch monitoring/alarms: A dedicated CloudWatch dashboard named reg-pilot has been established for both services. This dashboard provides in-depth metrics on memory usage, CPU utilization, and filesystem storage. Here, continuous monitoring will enhance our ability to fine-tune resource capacity planning and optimize performance. 
Automated alerts setup: manual alerts have been temporarily configured for witness service while automated email alerts are in progress. These alerts will notify the via email of any performance issues.

2byrds commented 1 month ago

For now, additional alerting/monitoring should be paused in favor of running the reg-pilot test scenarios against each rootsid profile (rootsid-dev and rootsid-test) in resolve-env.ts

2byrds commented 1 month ago

@ronakseth96 is coming up to speed well on the reg-pilot tests. Note that we can now simulate single/multi user and single/multi sig. We need to confirm with Ronak that with the latest PR from @aydarng that Ronak can successfully run all the tests in preparation for testing the individual deployments.

GLEIF-IT / reg-pilot

How do we occasionally test deployed instances and report errors? #52