DevOps-Nirvana / Kubernetes-Volume-Autoscaler

Autoscaling volumes for Kubernetes (with the help of Prometheus)
Apache License 2.0
273 stars 32 forks source link

All volume expansion alerts arriving at same timestamp in slack. WHY? #10

Closed rkdutta closed 1 year ago

rkdutta commented 1 year ago

Hello,

I am trying the autoscaler. And the solution is working as expected. The controller is successfully increasing the volume as per the defined settings below. However, I am receiving the related alerts at the same timestamp. It seems when the volume is finally increased all the previous warnings are coming at once (reference: 2 alerts arrived at 1:30 PM in the below snapshot). Please note in the evidence snapshot below it is shown for 2 alerts but the same behaviour was found in next occurrences.

Is this a bug or expected behaviour of the controller? If there is any settings that should be changed please advise.

image

Settings:

    - name: PROMETHEUS_URL
      value: <PROMETHEUS_URL>
    - name: SLACK_WEBHOOK_URL
      value: <SLACK_WEBHOOK_URL>
    - name: SLACK_CHANNEL
      value: <SLACK_CHANNEL>
    - name: INTERVAL_TIME
      value: "60"
    - name: SCALE_AFTER_INTERVALS
      value: "5"
    - name: SCALE_ABOVE_PERCENT
      value: "80"
    - name: SCALE_UP_PERCENT
      value: "20"
    - name: SCALE_UP_MIN_INCREMENT
      value: "1000000000"
    - name: SCALE_UP_MAX_INCREMENT
      value: "16000000000000"
    - name: SCALE_UP_MAX_SIZE
      value: "16000000000000"
    - name: SCALE_COOLDOWN_TIME
      value: "22200"
    - name: DRY_RUN
    - name: PROMETHEUS_LABEL_MATCH
    - name: HTTP_TIMEOUT
      value: "15"
    - name: VERBOSE
      value: "false"
    - name: VICTORIAMETRICS_MODE
      value: "false"
AndrewFarley commented 1 year ago

@rkdutta Can you grab logs from the pod to show me? Or from your log aggregation system if your pod has since died/been restarted? Also, are you using the latest version, or what version are you running?

rkdutta commented 1 year ago

@AndrewFarley Thanks for responding. Hope the following information helps. If you need more inputs please let me know.

Restart: The pod is running for more than 3 days now and never got restarted. Version: volume-autoscaler-1.0.6 (using helm) - just default installation with the configurations mentioned in the ticket. image: devopsnirvana/kubernetes-volume-autoscaler:1.0.6 repo:

devops-nirvana          https://devops-nirvana.s3.amazonaws.com/helm-charts/
➜  ~ helm search repo devops-nirvana
NAME                                CHART VERSION   APP VERSION DESCRIPTION
devops-nirvana/argo-cronjob         1.0.32                      The Universal Argo Cronjob/CronWorkflow Helm Chart
devops-nirvana/cronjob              1.0.32                      The Universal Cronjob Helm Chart
devops-nirvana/cronjob-multi        1.0.32                      The Universal Cronjob Multi Helm Chart, to spin...
devops-nirvana/deployment           1.0.32                      The Universal Deployment Helm Chart
devops-nirvana/statefulset          1.0.32                      The Universal Statefulset Helm Chart
devops-nirvana/volume-autoscaler    1.0.6           1.0.6       Volume Autoscaler scales Kubernetes volumes up ...

Logs related to above alerts:

Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 1 period(s) which needs to at least 5 period(s) to scale
  BUT need to wait for 5 intervals in alert before considering to scale
  FYI this has desired_size 3G and current size 3G
Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 2 period(s) which needs to at least 5 period(s) to scale
  BUT need to wait for 5 intervals in alert before considering to scale
  FYI this has desired_size 3G and current size 3G
Querying and found 16 valid PVCs to assess in prometheus
Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 3 period(s) which needs to at least 5 period(s) to scale
  BUT need to wait for 5 intervals in alert before considering to scale
  FYI this has desired_size 3G and current size 3G
Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 4 period(s) which needs to at least 5 period(s) to scale
  BUT need to wait for 5 intervals in alert before considering to scale
  FYI this has desired_size 3G and current size 3G
Querying and found 16 valid PVCs to assess in prometheus
Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 5 period(s) which needs to at least 5 period(s) to scale
  AND we need to scale it immediately, it has never been scaled previously
  RESIZING disk from 3G to 4G
  Desired New Size: 4000000000
  Actual New Size: 4000000000
Successfully requested to scale up `test-claim1` by `20%` from `3G` to `4G`, it was using more than `80%` disk space over the last `300 seconds`
Volume test-claim1 is 100% in-use of the 3G available
  BECAUSE it is above 80% used
  ALERT has been for 6 period(s) which needs to at least 5 period(s) to scale
  AND we need to scale it immediately, it has never been scaled previously
  RESIZING disk from 3G to 4G
  Desired New Size: 4000000000
  Actual New Size: 4000000000
Successfully requested to scale up `test-claim1` by `20%` from `3G` to `4G`, it was using more than `80%` disk space over the last `360 seconds`
rkdutta commented 1 year ago

Can anyone help or advise?

AndrewFarley commented 1 year ago

@rkdutta I've reviewed some of the code and nothing stands out as a change that I can make. I will try to improve some of the logging for a release/update I'm making for this service later today or tomorrow. If you can maybe try the new version and let me know if the issue still persists. Thanks. I'll let you know when I release it...

AndrewFarley commented 1 year ago

I think I'm going to add some de-bounce logic to internally prevent it from trying to modify a volume more than once too quickly in a row. It seems like maybe your volume didn't update properly in Kubernetes somehow, even though it didn't tell you about this. Can you tell me what Kubernetes providers you're on (cloud, or self-hosted), and what storage controller you're using? @rkdutta

AndrewFarley commented 1 year ago

I've added a debounce in https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/commit/51d18484304570d9ae36404ed1a8a7235c217fef and will be releasing this shortly and closing this bug. After I release the new version please try it and report in if your issue still persists. It shouldn't happen any more if it was what I suspect it was which is just your storage controller taking a while to fully update Kubernetes.

AndrewFarley commented 1 year ago

@rkdutta There's an improvement in 1.0.7 which was just released and has been published to the Helm Chart repository. Please update your deployment and let me know if this happens again. There is now a debounce logic inside which will prevent the engine from re-trying the same volume resize for at least 10 intervals. That I believe may help your situation and it's generally not going to be harmful for anyone else.

Closing issue as resolved. Please re-open or open a new one if there's any issues with this or further information. Thanks!