Closed AndrewFarley closed 2 years ago
UPDATE: There was an issue with their Kubernetes cluster causing this, wasn't a Prometheus or issue in this codebase.
@AndrewFarley Would you be so kind please to explain what was the exact issue in k8s cluster? Have the same issue after PVC resizing and restart of kubelet didnt help
@virtualb0x Yeah, so I figured out this customer was using an older version of the AWS EBS CSI Controller and it hit an edge case where it sent (or thought it sent) AWS the command to resize the disk upwards. However, either AWS never received this command, or it was unable to fulfill this request properly, leaving it in a bunk state. It was weird, the graph I linked to showed that it did fulfill the request and resize, but it was an error (unable to find which/where this is from) in some underlying AWS stuff in AWS Container Roadmap where it was mis-reporting partially that it had resized the disk but it hadn't actually.
To fix this problem for this customer I did a few things...
That seemed to solve things. Sorry I didn't clarify that. Hope this helps you, and anyone else who might run into this! If you weren't using AWS as your provider, I think a similar set of steps would still help. That being some combination of upgrading EKS, upgrading your storage driver, stopping the pod using the volume, resizing it manually, and then starting the pod backup.
There appears to be a bug in Prometheus Server which causes the
kubelet_volume_stats_capacity_bytes
to not be updated properly in Prometheus after a resize. Note: May need to go file a bug against the metrics-server or Prometheus.After further investigation, it appears the prometheus metrics of kube_persistentvolume_capacity_bytes which is tied to the "PV" and not the "PVC" is fully updated, and we could (in theory) instead look there for the updated value but I believe this to be a bug which should be fixed in Prometheus.