DevOps-Nirvana / Kubernetes-Volume-Autoscaler

Autoscaling volumes for Kubernetes (with the help of Prometheus)
Apache License 2.0
274 stars 33 forks source link

Customer-reported issue: Is not detecting updated/resized max size #1

Closed AndrewFarley closed 2 years ago

AndrewFarley commented 2 years ago

There appears to be a bug in Prometheus Server which causes the kubelet_volume_stats_capacity_bytes to not be updated properly in Prometheus after a resize. Note: May need to go file a bug against the metrics-server or Prometheus.

After further investigation, it appears the prometheus metrics of kube_persistentvolume_capacity_bytes which is tied to the "PV" and not the "PVC" is fully updated, and we could (in theory) instead look there for the updated value but I believe this to be a bug which should be fixed in Prometheus.

Screen Shot 2022-03-07 at 10 00 37 AM
AndrewFarley commented 2 years ago

UPDATE: There was an issue with their Kubernetes cluster causing this, wasn't a Prometheus or issue in this codebase.

virtualb0x commented 2 years ago

@AndrewFarley Would you be so kind please to explain what was the exact issue in k8s cluster? Have the same issue after PVC resizing and restart of kubelet didnt help

AndrewFarley commented 2 years ago

@virtualb0x Yeah, so I figured out this customer was using an older version of the AWS EBS CSI Controller and it hit an edge case where it sent (or thought it sent) AWS the command to resize the disk upwards. However, either AWS never received this command, or it was unable to fulfill this request properly, leaving it in a bunk state. It was weird, the graph I linked to showed that it did fulfill the request and resize, but it was an error (unable to find which/where this is from) in some underlying AWS stuff in AWS Container Roadmap where it was mis-reporting partially that it had resized the disk but it hadn't actually.

To fix this problem for this customer I did a few things...

That seemed to solve things. Sorry I didn't clarify that. Hope this helps you, and anyone else who might run into this! If you weren't using AWS as your provider, I think a similar set of steps would still help. That being some combination of upgrading EKS, upgrading your storage driver, stopping the pod using the volume, resizing it manually, and then starting the pod backup.