Investigate explosion of disk usage of prometheus

yuvipanda commented 1 year ago

You can see that disk usage exploded around June 28. https://github.com/2i2c-org/infrastructure/pull/2730 coincides with that - a prometheus version bump

yuvipanda commented 1 year ago

No corresponding explosion in time series

Using prometheus_tsdb_head_series

yuvipanda commented 1 year ago

Second inflection point was around when https://github.com/2i2c-org/infrastructure/pull/2621 was merged

yuvipanda commented 1 year ago

Looks like this is actually the case for a lot of our prometheii

yuvipanda commented 1 year ago

https://github.com/2i2c-org/infrastructure/pull/2932 bumps up everything to a miniumum of 200Gi.

yuvipanda commented 1 year ago

I made a simple prometheus dashboard showing remaining disk space in all our prometheuses: https://grafana.pilot.2i2c.cloud/d/d560fa89-38bc-4253-964e-82573ea91d57/global-prometheus-health?orgId=1&var-cluster=All

yuvipanda commented 1 year ago

ok, most prometheuses have more disk space now! But this isn't a long term solution yet.

consideRatio commented 1 year ago

These relevant changes were made June 27-29.

I wonder if it was prometheus version bump or config change.

yuvipanda commented 1 year ago

https://github.com/2i2c-org/infrastructure/pull/2730 is also relevant as it added long running time serires metrics at I think 5 per user on each hub.

yuvipanda commented 1 year ago

https://last9.io/blog/how-to-manage-high-cardinality-metrics-in-prometheus/ has useful info on how to debug prometheus series counters. However, trying to use grafana for this was causing a lot of timeouts, so we need to get https://github.com/2i2c-org/infrastructure/issues/2934 fixed first.

yuvipanda commented 1 year ago

Ok, now by directly accessing the prometheus server, I can look at series that have exploding cardinality! Am looking at the openscapes hub.

topk(10, count by (__name__) ({__name__=~".+"})) shows:

metric	count
apiserver_request_duration_seconds_bucket{}	11792
apiserver_request_slo_duration_seconds_bucket{}	8712
etcd_request_duration_seconds_bucket{}	5028
apiserver_response_sizes_bucket{}	2544
apiserver_storage_list_duration_seconds_bucket{}	1596
jupyterhub_request_duration_seconds_bucket{}	1365
apiserver_watch_events_sizes_bucket{}	1170
storage_operation_duration_seconds_bucket{}	952
container_tasks_state{}	935
kubelet_runtime_operations_duration_seconds_bucket{}	840

yuvipanda commented 1 year ago

Looks like https://github.com/kubernetes/kubernetes/issues/105346 is part of the issue here.

yuvipanda commented 1 year ago

http://localhost:9090/tsdb-status also produces very helpful information.

yuvipanda commented 1 year ago

I wanted to understand what was using the most disk space. Unfortunately there's no easy mapping from series name to disk space.

I currently have two hypothesis:

apiserver_request_duration_seconds_bucket is the source of the problem, because https://github.com/kubernetes/kubernetes/issues/105346 and the fact it has so many metrics.
https://github.com/2i2c-org/infrastructure/pull/2621 is the source of our problem.

yuvipanda commented 1 year ago

Am following https://faun.pub/how-to-drop-and-delete-metrics-in-prometheus-7f5e6911fb33 to gather some more information.

First, here's the output of promtool tsdb analyze /data/, showing what are the metrics with highest cardinality:

/prometheus $ promtool tsdb analyze /data/
Block ID: 01H78RMRJ355HMXCDN7B6V07WZ
Duration: 4m58.817s
Series: 79527
Label names: 297
Postings (unique label pairs): 5447
Postings entries (total label pairs): 1021265

Label pairs most involved in churning:
8951 job=kubernetes-apiservers
5103 component=apiserver
4808 instance=192.168.124.252:443
4608 version=v1
4142 instance=192.168.155.110:443
3048 topology_ebs_csi_aws_com_zone=us-west-2b
3048 k8s_io_cloud_provider_aws=4d6fc0cdc07fcc46278e3eae48a92044
3048 kubernetes_io_arch=amd64
3048 beta_kubernetes_io_arch=amd64
3048 failure_domain_beta_kubernetes_io_zone=us-west-2b
3048 kubernetes_io_os=linux
3048 failure_domain_beta_kubernetes_io_region=us-west-2
3048 topology_kubernetes_io_region=us-west-2
3048 alpha_eksctl_io_cluster_name=openscapeshub
3048 beta_kubernetes_io_os=linux
3048 node_lifecycle=on-demand
3048 topology_kubernetes_io_zone=us-west-2b
2930 job=kubernetes-service-endpoints
2874 app_kubernetes_io_managed_by=Helm
2848 node=ip-192-168-61-164.us-west-2.compute.internal

Label names most involved in churning:
15778 job
15778 __name__
15778 instance
8610 le
6152 component
5753 resource
5497 verb
5424 version
5286 namespace
5256 scope
3782 group
3553 node
3048 node_lifecycle
3048 alpha_eksctl_io_instance_id
3048 alpha_eksctl_io_cluster_name
3048 hub_jupyter_org_node_purpose
3048 kubernetes_io_os
3048 beta_kubernetes_io_arch
3048 topology_kubernetes_io_region
3048 topology_kubernetes_io_zone

Most common label pairs:
45479 job=kubernetes-apiservers
25927 component=apiserver
24429 instance=192.168.124.252:443
23412 version=v1
21050 instance=192.168.155.110:443
15414 kubernetes_io_arch=amd64
15414 failure_domain_beta_kubernetes_io_region=us-west-2
15414 kubernetes_io_os=linux
15414 beta_kubernetes_io_os=linux
15414 failure_domain_beta_kubernetes_io_zone=us-west-2b
15414 beta_kubernetes_io_arch=amd64
15414 topology_ebs_csi_aws_com_zone=us-west-2b
15414 alpha_eksctl_io_cluster_name=openscapeshub
15414 node_lifecycle=on-demand
15414 topology_kubernetes_io_zone=us-west-2b
15414 k8s_io_cloud_provider_aws=4d6fc0cdc07fcc46278e3eae48a92044
15414 topology_kubernetes_io_region=us-west-2
14720 job=kubernetes-service-endpoints
14435 app_kubernetes_io_managed_by=Helm
14423 node=ip-192-168-61-164.us-west-2.compute.internal

Label names with highest cumulative label value length:
43009 __name__
28847 id
22476 mountpoint
10136 name
6776 container_id
5211 directory
4210 type
4196 replicaset
3918 image_id
3718 resource
3561 secret
2700 uid
2098 image
1881 image_spec
1629 le
1506 pod
1489 handler
1352 owner_name
1337 device
1324 resource_prefix

Highest cardinality labels:
1207 __name__
546 directory
290 name
238 le
213 mountpoint
181 id
176 type
175 resource
145 replicaset
120 secret
88 container_id
81 device
76 method
75 uid
66 kind
65 cpu
64 resource_prefix
53 owner_name
52 address
52 pod

Highest cardinality metric names:
11792 apiserver_request_duration_seconds_bucket
8712 apiserver_request_slo_duration_seconds_bucket
5028 etcd_request_duration_seconds_bucket
2544 apiserver_response_sizes_bucket
1596 apiserver_storage_list_duration_seconds_bucket
1365 jupyterhub_request_duration_seconds_bucket
1170 apiserver_watch_events_sizes_bucket
952 storage_operation_duration_seconds_bucket
950 container_tasks_state
840 kubelet_runtime_operations_duration_seconds_bucket
760 container_memory_failures_total
714 apiserver_admission_controller_admission_duration_seconds_bucket
678 container_blkio_device_usage_total
672 apiserver_flowcontrol_priority_level_request_count_watermarks_bucket
580 apiserver_request_total
576 node_cpu_seconds_total
566 dirsize_latest_mtime
566 dirsize_last_updated_ns
566 dirsize_processing_time
566 dirsize_total_size_bytes

There's a lot of evidence for apiserver_request_duration_seconds_bucket being the problem here.

So, df -h shows me this to be disk usage:

/dev/nvme1n1            196.7G    105.3G     91.3G  54% /data

I'm now going to delete that metric, run a compaction and see.

yuvipanda commented 1 year ago

I'm exec'd into the prometheus server container. There's no curl, so am running the following wget command. This is busybox's 'wget':

$ wget --post-data '' -S 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=apiserver_request_duration_seconds_bucket'

This leads to pod restarts, usually with liveness probe failures (as seen from describe pod):

  Warning  Unhealthy  51s                 kubelet            Liveness probe failed: Get "http://192.168.42.178:9090/-/healthy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

yuvipanda commented 1 year ago

I removed all the probes by editing the deployment and setting them to null to no effect, because it just goes out of RAM I think. Let me bump up memory limits as well.

yuvipanda commented 1 year ago

Temporarily bumping memory limit to 16G (from 8G) to see how this goes. I primarily wanna see if deleting this time series brings down disk usage.

yuvipanda commented 1 year ago

ok, that succeeded!

Connecting to localhost:9090 (127.0.0.1:9090)
  HTTP/1.1 204 No Content
  Date: Mon, 07 Aug 2023 20:33:40 GMT
  Connection: close

Now running compaction, with:

$ wget --post-data '' -S  http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

btw, I have another terminal open exec'd into the same container, with top running to monitor memory & cpu usage.

yuvipanda commented 1 year ago

The first clean tombstone api call failed with a wget timeout, and the df -h output was:

/dev/nvme1n1            196.7G    101.3G     95.3G  52% /data

So about a 4G drop. I'm running it again, let's see.

yuvipanda commented 1 year ago

Interestingly, I don't see CPU usage go over 25%. iotop isn't present on the container image so I can't tell if it's bottlenecked on io, but wouldn't be surprised.

yuvipanda commented 1 year ago

Based on http://tinycorelinux.net/12.x/armv6/releases/RPi/src/busybox/busybox-1.27.1-wget-make-default-timeout-configurable.patch, the default wget timeout is 900s, and that's where the timeout is coming from. Clearning tombstones is going to take more time than that, so am increasing it by passing -T 9000. It's still running.

yuvipanda commented 1 year ago

ok, so clean-tombstones completed, and this is the size on disk now:

/dev/nvme1n1            196.7G     93.3G    103.3G  47% /data

That's about 8G data reduction. That's a far cry from the close to 60GB that have been used up since June, so I don't think this is the cause.

However, the output of promtool tsdb analyze is not very different:

/prometheus $ promtool tsdb analyze  /data/
Block ID: 01H7942W0GW468MZCC40T5GXNH
Duration: 4m59.828s
Series: 79930
Label names: 298
Postings (unique label pairs): 5486
Postings entries (total label pairs): 1029416

Label pairs most involved in churning:
9085 job=kubernetes-apiservers
5184 component=apiserver
4874 instance=192.168.124.252:443
4682 version=v1
4210 instance=192.168.155.110:443
3073 topology_ebs_csi_aws_com_zone=us-west-2b
3073 kubernetes_io_os=linux
3073 node_lifecycle=on-demand
3073 alpha_eksctl_io_cluster_name=openscapeshub
3073 topology_kubernetes_io_zone=us-west-2b
3073 beta_kubernetes_io_arch=amd64
3073 failure_domain_beta_kubernetes_io_region=us-west-2
3073 topology_kubernetes_io_region=us-west-2
3073 failure_domain_beta_kubernetes_io_zone=us-west-2b
3073 k8s_io_cloud_provider_aws=4d6fc0cdc07fcc46278e3eae48a92044
3073 beta_kubernetes_io_os=linux
3073 kubernetes_io_arch=amd64
2969 job=kubernetes-service-endpoints
2912 app_kubernetes_io_managed_by=Helm
2899 node=ip-192-168-61-164.us-west-2.compute.internal

Label names most involved in churning:
15919 instance
15919 job
15919 __name__
8689 le
6255 component
5843 resource
5584 verb
5509 version
5346 scope
5330 namespace
3835 group
3601 node
3073 topology_ebs_csi_aws_com_zone
3073 beta_kubernetes_io_arch
3073 failure_domain_beta_kubernetes_io_region
3073 node_kubernetes_io_instance_type
3073 hub_jupyter_org_node_purpose
3073 alpha_eksctl_io_cluster_name
3073 kubernetes_io_arch
3073 node_lifecycle

Most common label pairs:
45533 job=kubernetes-apiservers
25980 component=apiserver
24430 instance=192.168.124.252:443
23464 version=v1
21103 instance=192.168.155.110:443
15599 node_lifecycle=on-demand
15599 beta_kubernetes_io_os=linux
15599 failure_domain_beta_kubernetes_io_zone=us-west-2b
15599 topology_kubernetes_io_region=us-west-2
15599 beta_kubernetes_io_arch=amd64
15599 topology_kubernetes_io_zone=us-west-2b
15599 failure_domain_beta_kubernetes_io_region=us-west-2
15599 k8s_io_cloud_provider_aws=4d6fc0cdc07fcc46278e3eae48a92044
15599 kubernetes_io_arch=amd64
15599 topology_ebs_csi_aws_com_zone=us-west-2b
15599 kubernetes_io_os=linux
15599 alpha_eksctl_io_cluster_name=openscapeshub
14882 job=kubernetes-service-endpoints
14597 app_kubernetes_io_managed_by=Helm
14528 node=ip-192-168-61-164.us-west-2.compute.internal

Label names with highest cumulative label value length:
43081 __name__
29334 id
23708 mountpoint
10264 name
6930 container_id
5211 directory
4268 replicaset
4210 type
4017 image_id
3718 resource
3561 secret
2736 uid
2136 image
1909 image_spec
1629 le
1528 handler
1520 pod
1403 device
1352 owner_name
1324 resource_prefix

Highest cardinality labels:
1209 __name__
546 directory
292 name
238 le
225 mountpoint
184 id
176 type
175 resource
147 replicaset
120 secret
90 container_id
83 device
76 uid
76 method
66 kind
65 cpu
64 resource_prefix
53 address
53 owner_name
53 pod

Highest cardinality metric names:
11814 apiserver_request_duration_seconds_bucket
8734 apiserver_request_slo_duration_seconds_bucket
5028 etcd_request_duration_seconds_bucket
2544 apiserver_response_sizes_bucket
1596 apiserver_storage_list_duration_seconds_bucket
1395 jupyterhub_request_duration_seconds_bucket
1170 apiserver_watch_events_sizes_bucket
965 container_tasks_state
952 storage_operation_duration_seconds_bucket
840 kubelet_runtime_operations_duration_seconds_bucket
772 container_memory_failures_total
714 apiserver_admission_controller_admission_duration_seconds_bucket
702 container_blkio_device_usage_total
672 apiserver_flowcontrol_priority_level_request_count_watermarks_bucket
583 apiserver_request_total
576 node_cpu_seconds_total
566 dirsize_processing_time
566 dirsize_entries_count
566 dirsize_last_updated_ns
566 dirsize_latest_mtime

So am not entirely sure what to make of this now.

yuvipanda commented 1 year ago

Exploring hypothesis 2, given that the total number of series stored did not explode (see https://github.com/2i2c-org/infrastructure/issues/2930#issuecomment-1666583674), it reasons that perhaps the amount of bytes required to store any particular series did explode. Given that the prometheus dirsize exporter was added right around the time disk space problems started, I've implemented https://github.com/yuvipanda/prometheus-dirsize-exporter/pull/3 and will attempt a deployment of that shortly. The hypothesis here is that by providing a new value for the time series in question at almost every scrape (due to it being 'amount of time it took to process this directory', which will not vary), compression is extremely poor. It's almost random data, which will compress very badly. It's also not a useful metric.

I'm going to remove providing that in the scrapes, and then delete that metric, and see what happens to the disk space.

yuvipanda commented 1 year ago

Hmm, after this round of cleanup, I see:

/dev/nvme1n1            196.7G     93.6G    103.0G  48% /data

So that's not a lot. I'm not entirely sure now what caused the explosion.

consideRatio commented 1 year ago

@yuvipanda amazing work doing things and documenting what you do!!!

I got nothing conclusive, but dumped some thoughts below, feel free to just skim as I have no clear action point or similar.

About changing min/max-block-size

Have you ruled out #2731?

Can we make this number even smaller? 10min?

I mean yes maybe, but I'm not able to say if its okay with 1m, 10m, 30m, or 2h (default for the max value I think). I have no real idea about the implications of changing this, but its a property I've understood prometheus maintainers prefer users don't fiddle much with through indications by the community when reading issues.

Looking back at this, I saw from https://www.percona.com/blog/prometheus-2-times-series-storage-performance-analyses/ the following claim without reference:

While the configuration knob exists to change the head block size, tuning this by users is discouraged. So you’re limited to providing Prometheus 2 with as much memory as it needs for your workload.

Overall just worried that we have entered unknown territory by tuning this.

About clearing WAL

We are clearing the WAL via #2706, and we did #2731 to avoid issues from doing that.

# Prometheus sometimes seems to run into bugs where the 'write ahead log'
# used for recovery from crashes and network failures gets massive

I read that the WAL is supposedly flushed every 10 seconds, and that would imply for me that its not important to have it around for so long. It seems that the issue may be that the WAL isn't cleaned up based on https://github.com/prometheus/prometheus/issues/7955.

Is there a setting to help clean irrelevant WAL without all WAL?

Understanding WAL's role

I found https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/ about WAL stuff, so I'm digging into it.

yuvipanda commented 1 year ago

Copying https://github.com/2i2c-org/infrastructure/pull/2967#issuecomment-1673737352 here:

I looked into adjusting a scraping interval for certain metrics - specifically for the dirsize scraping that updates every 2 hours or anyhow. If we scraped that on an hourly basis instead it would make it fine to have finegrained details about individual users etc.

I've instead just temporarily disabled them completely in https://github.com/2i2c-org/infrastructure/pull/2968 to see if that's actually the cause.

yuvipanda commented 1 year ago

A bunch were down again. I increased default disk space, and then also the disk space for the 2i2c shared one, and it's back up again now.

ls config/clusters | xargs -I {} -L1 deployer use-cluster-credentials {} 'kubectl -n support get pod -l app.kubernetes.io/component=server'

This lets me validate they're all running

yuvipanda commented 1 year ago

This was basically solved by https://github.com/2i2c-org/infrastructure/pull/3093!!!!

2i2c-org / infrastructure