OpenHistoricalMap / issues

File your issues here, regardless of repo until we get all our repos squared away; we don't want to miss anything.
Creative Commons Zero v1.0 Universal
19 stars 1 forks source link

Grafana Dashboard for monitoring resources and services #861

Open Rub21 opened 3 months ago

Rub21 commented 3 months ago

Due to the recent increase in resource usage in https://github.com/OpenHistoricalMap/issues/issues/783 and https://github.com/OpenHistoricalMap/issues/issues/850 It would be convenient to monitor our resources and make this visible to everyone. As OSM does https://github.com/OpenHistoricalMap/issues/issues/783#issuecomment-2214573538, I think it would be reasonable to add a Grafana dashboard based on Prometheus to see how many resources the services are using in each service. Currently, we have Prometheus and some exporters in the staging and production clusters, but access to these statistics is still restricted. I could open the access to these statistics.

cc. @batpad @jeffreyameyer @danrademacher

danrademacher commented 1 month ago

@Rub21 do you have a sense of costs involved here?

I see they have a free tier, but I am not sure if we fit there: https://grafana.com/pricing/

danrademacher commented 1 month ago

If we fit in free tier, then this is an easy Yes. If it costs, we need to confer with Jeff

Rub21 commented 1 month ago

If we fit in free tier, then this is an easy Yes. If it costs, we need to confer with Jeff

This is free; it will just consume some resources in our infrastructure, but it will help check how much is used.

Results in Staging, https://monitoring.staging.openhistoricalmap.org

I've been using Prometheus and Grafana to monitor server performance, specifically tracking CPU and memory usage, along with the resource consumption of the pods they host. In the staging environment, I checked if we have sufficient resources or more than enough.

Screenshot 2024-10-16 at 3 04 56 AM

I've adjusted the node resources in staging and reduced the size to two large machines, but it still seems we have plenty of resources available.

image

So, I ended up with a large machine and a medium machine.

image image

The dashboard is very helpful; at least it's showing how much our infrastructure is actually being used by our service, and we can reduce the size to save costs.

We also have a dashboard for production at https://monitoring.openhistoricalmap.org/. Before reducing production resources, we might need to discuss it further .

Savings in staging for this month: I’ve already reduced the size of the machines. Running two m5.xlarge instances costs approximately $370.84 per month, while using one m5.large ($70.08) and one m1.medium ($33.60) totals $103.68 per month. This means we’d save approximately $267.16 per month compared to using two m5.xlarge instances.

cc. @danrademacher @batpad @jeffreyameyer

Rub21 commented 3 weeks ago

Current status for Production environment using the grafana

I have also implemented the Prometheus and Grafana dashboard for production, which provides a better overview of how the nodes are using resources over the last 10 days, giving insights into how much resources we are actually utilizing.

The current web API database is running alone on a node, but it doesn’t seem to be consuming many resources. Last time, we upgraded the node size due to an issue with cgimap link, but the node has been using less than 10% over the last 12 days. My recommendation is to reduce the node size by half.

image

For the tiler database, we are running fine. the size is more than enough, though there are some peaks where it is using 90% of the CPU and RAM, but seems fine.

image

For the web container, we are running fine, but reducing it to smaller nodes and leveraging the autoscaling we implemented in the past could help reduce node costs

image

cc. @batpad @danrademacher @jeffreyameyer

danrademacher commented 1 week ago

Based on discussion, we should reduce the web database API to 4 CPU and 16GB RAM.

Rub21 commented 1 week ago
Rub21 commented 3 days ago

The node for api db has been updated, we are using now an x.large machine!