Closed rkdutta closed 1 year ago
@rkdutta Can you grab logs from the pod to show me? Or from your log aggregation system if your pod has since died/been restarted? Also, are you using the latest version, or what version are you running?
@AndrewFarley Thanks for responding. Hope the following information helps. If you need more inputs please let me know.
Restart: The pod is running for more than 3 days now and never got restarted. Version: volume-autoscaler-1.0.6 (using helm) - just default installation with the configurations mentioned in the ticket. image: devopsnirvana/kubernetes-volume-autoscaler:1.0.6 repo:
devops-nirvana https://devops-nirvana.s3.amazonaws.com/helm-charts/
➜ ~ helm search repo devops-nirvana
NAME CHART VERSION APP VERSION DESCRIPTION
devops-nirvana/argo-cronjob 1.0.32 The Universal Argo Cronjob/CronWorkflow Helm Chart
devops-nirvana/cronjob 1.0.32 The Universal Cronjob Helm Chart
devops-nirvana/cronjob-multi 1.0.32 The Universal Cronjob Multi Helm Chart, to spin...
devops-nirvana/deployment 1.0.32 The Universal Deployment Helm Chart
devops-nirvana/statefulset 1.0.32 The Universal Statefulset Helm Chart
devops-nirvana/volume-autoscaler 1.0.6 1.0.6 Volume Autoscaler scales Kubernetes volumes up ...
Logs related to above alerts:
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 1 period(s) which needs to at least 5 period(s) to scale
BUT need to wait for 5 intervals in alert before considering to scale
FYI this has desired_size 3G and current size 3G
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 2 period(s) which needs to at least 5 period(s) to scale
BUT need to wait for 5 intervals in alert before considering to scale
FYI this has desired_size 3G and current size 3G
Querying and found 16 valid PVCs to assess in prometheus
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 3 period(s) which needs to at least 5 period(s) to scale
BUT need to wait for 5 intervals in alert before considering to scale
FYI this has desired_size 3G and current size 3G
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 4 period(s) which needs to at least 5 period(s) to scale
BUT need to wait for 5 intervals in alert before considering to scale
FYI this has desired_size 3G and current size 3G
Querying and found 16 valid PVCs to assess in prometheus
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 5 period(s) which needs to at least 5 period(s) to scale
AND we need to scale it immediately, it has never been scaled previously
RESIZING disk from 3G to 4G
Desired New Size: 4000000000
Actual New Size: 4000000000
Successfully requested to scale up `test-claim1` by `20%` from `3G` to `4G`, it was using more than `80%` disk space over the last `300 seconds`
Volume test-claim1 is 100% in-use of the 3G available
BECAUSE it is above 80% used
ALERT has been for 6 period(s) which needs to at least 5 period(s) to scale
AND we need to scale it immediately, it has never been scaled previously
RESIZING disk from 3G to 4G
Desired New Size: 4000000000
Actual New Size: 4000000000
Successfully requested to scale up `test-claim1` by `20%` from `3G` to `4G`, it was using more than `80%` disk space over the last `360 seconds`
Can anyone help or advise?
@rkdutta I've reviewed some of the code and nothing stands out as a change that I can make. I will try to improve some of the logging for a release/update I'm making for this service later today or tomorrow. If you can maybe try the new version and let me know if the issue still persists. Thanks. I'll let you know when I release it...
I think I'm going to add some de-bounce logic to internally prevent it from trying to modify a volume more than once too quickly in a row. It seems like maybe your volume didn't update properly in Kubernetes somehow, even though it didn't tell you about this. Can you tell me what Kubernetes providers you're on (cloud, or self-hosted), and what storage controller you're using? @rkdutta
I've added a debounce in https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/commit/51d18484304570d9ae36404ed1a8a7235c217fef and will be releasing this shortly and closing this bug. After I release the new version please try it and report in if your issue still persists. It shouldn't happen any more if it was what I suspect it was which is just your storage controller taking a while to fully update Kubernetes.
@rkdutta There's an improvement in 1.0.7 which was just released and has been published to the Helm Chart repository. Please update your deployment and let me know if this happens again. There is now a debounce logic inside which will prevent the engine from re-trying the same volume resize for at least 10 intervals. That I believe may help your situation and it's generally not going to be harmful for anyone else.
Closing issue as resolved. Please re-open or open a new one if there's any issues with this or further information. Thanks!
Hello,
I am trying the autoscaler. And the solution is working as expected. The controller is successfully increasing the volume as per the defined settings below. However, I am receiving the related alerts at the same timestamp. It seems when the volume is finally increased all the previous warnings are coming at once (reference: 2 alerts arrived at 1:30 PM in the below snapshot). Please note in the evidence snapshot below it is shown for 2 alerts but the same behaviour was found in next occurrences.
Is this a bug or expected behaviour of the controller? If there is any settings that should be changed please advise.
Settings: