Utilization metrics - Githubissues

bprashanth commented 2 months ago

are we running out of cpu/ram/disk in the vm?
put in a system of notifications that tells us if we are. Avoid oom kills.
How do we do this through standard AWS infra? alerts and metrics that trigger on system resources

anmolsingh0219 commented 2 months ago

Monitor CPU, RAM, and Disk Usage:

We can use Amazon CloudWatch to monitor the CPUUtilization metric by default. This gives us insight into CPU performance on our EC2 instances.
To extend monitoring to RAM and disk space, we can install the CloudWatch Agent on our EC2 instances. This allows us to capture metrics like memory usage (mem_used_percent) and disk usage (disk_used_percent), providing a comprehensive view of resource consumption.
Set Thresholds and Alerts:

We can create CloudWatch Alarms for key metrics:

CPUUtilization to monitor CPU performance.
mem_used_percent to keep track of memory usage.
disk_used_percent to alert us when disk space is running low.
Setting thresholds (e.g., 80% utilization) ensures we are notified well before resource limits are reached.
We can configure Amazon SNS to send real-time notifications (via email or SMS) whenever these thresholds are crossed, allowing us to take action proactively before the system runs out of resources.

Auto-Scaling:

To prevent resource shortages during high traffic, we can implement Auto-Scaling. This allows our infrastructure to automatically scale up (add instances) when CPU or RAM usage increases, and scale down when demand decreases.
With Elastic Load Balancer (ELB), we can balance incoming traffic across multiple EC2 instances, ensuring that the load is distributed evenly and preventing overloading of any single instance.

DNS Failover with Route 53:

We can set up Route 53 with health checks to ensure that our DNS can automatically failover if one of our EC2 instances becomes unhealthy or unavailable.
By doing this, Route 53 will reroute traffic to healthy instances, ensuring minimal downtime and maintaining service availability even in the event of failure.

anmolsingh0219 commented 2 months ago

1. CloudWatch Metrics

Free Tier:

10 custom metrics per month, with 1-minute granularity (standard resolution).

Additional Costs:

Custom Metrics: $0.30 per metric per month.
Detailed Monitoring for EC2: $0.015 per instance per hour (for 1-minute metrics instead of default 5-minute metrics).

2. CloudWatch Logs

Free Tier:

5GB of logs ingested and 5GB of logs archived per month.

Additional Costs:

Ingested Log Data: $0.50 per GB ingested.
Archived Log Data: $0.03 per GB stored per month.

3. CloudWatch Alarms

Free Tier:

10 alarms per month.

Additional Costs:

Standard Resolution Alarms: $0.10 per alarm per month.
High-Resolution Alarms (1-second granularity): $0.30 per alarm per month.

4. CloudWatch Dashboards

Free Tier:

3 dashboards with up to 50 metrics per month.

Additional Costs:

Custom Dashboards: $3.00 per dashboard per month (after the free tier limit).

5. CloudWatch Events

Free Tier:

1 million events delivered per month.

Additional Costs:

$1.00 per million events after the free tier.

Example Cost Calculation:

Assume the following usage scenario:

5 custom metrics (e.g., CPU, memory, disk).
10 alarms (e.g., high CPU or memory).
1 dashboard to visualize these metrics.
5GB of logs ingested per month.

Cost Breakdown:

Metrics: $0.30 per metric per month after the free tier (10 custom metrics).
Alarms: 10 alarms are free, additional alarms cost $0.10 each per month.
Logs: Ingesting over 5GB of logs costs $0.50 per GB.
Dashboard: 1 dashboard with fewer than 50 metrics is free.

Monthly Costs Estimate:

Based on the free tier usage:

5 custom metrics: Free.
10 alarms: Free.
5GB logs ingested: Free.
1 dashboard: Free.

bprashanth commented 2 months ago

From on call discussions

Resource/utilization metrics:
- implement metrics/alerts for disk and ram
- look up best practices for cpu metrics/alerting: we don't want very frequent cpu alerts, we want to know if cpu starvation is happening on a regular basis so we can bump up the cpu
Custom metrics:
- check if we can record these via aws free tier
- if yes, what client libs can we use in our python code to send the timer stats to aws

anmolsingh0219 commented 2 months ago

@anmolsingh0219 @bprashanth To research about setting up custom metric on AWS for RAM and Disk

bprashanth commented 2 months ago

We want to same metrics we have for cpu (an alert if it crosses the 80% threshold) for ram and disk.

There should be a pre canned way to achieve this. If not we'll just have to do a custom metrics.

We also want a read latency RDS metric. We're assuming if the read latency spikes we can increase the RDS instance size to bring it back down. We don't think write latency matters much since our users will not be writing to RDS (most writes are when we ingest data).

WELLlabs / JaltolAPI

Utilization metrics #22

Monitor CPU, RAM, and Disk Usage:

1. CloudWatch Metrics

2. CloudWatch Logs

3. CloudWatch Alarms

4. CloudWatch Dashboards

5. CloudWatch Events

Example Cost Calculation:

Monthly Costs Estimate: