cloudfoundry / prometheus-boshrelease

bosh release for prometheus ecosystem
Apache License 2.0
112 stars 162 forks source link

Error calling humanizeDuration: can't convert int to float #432

Closed bg-govau closed 2 years ago

bg-govau commented 2 years ago

We've noticed the bosh disk predict alerts that use humanizeDuration seem to now have errors after going from v26.5.0 to v26.6.0.

<error expanding template: error executing template alert_BOSHJobEphemeralDiskPredictWillFill: template: __alert_BOSHJobEphemeralDiskPredictWillFill:1:297: executing "alert_BOSHJobEphemeralDiskPredictWillFill" at <humanizeDuration 14400>: error calling humanizeDuration: can't convert int to float>

We have alertmanager pushing alerts to slack, and the above is what was posted to slack.

We have been finding in our environment whenever we change an instance stemcell from xenial to bionic, the bosh disk predict alerts always immediately fire for a little while, so these ones were pretty obvious to us, but it probably affects all alerts which use humanize* with an integer.

We were previously just using the default values for these properties, but as a workaround I have found setting them in an ops file like this to fix the issue:

- type: replace
  path: /instance_groups/name=prometheus2/jobs/name=bosh_alerts/properties?/bosh_alerts
  value:
    job_predict_system_disk_full:
      predict_time: "14400.0"
    job_predict_ephemeral_disk_full:
      predict_time: "14400.0"
    job_predict_persistent_disk_full:
      predict_time: "14400.0"

This ensures that the values in /var/vcap/jobs/bosh_alerts/bosh_system_predict.alerts.yml are written as floats instead of integers i.e.

...
        annotations:
          summary: "BOSH Job `{{$labels.environment}}/{{$labels.bosh_name}}/{{$labels.bosh_deployment}}/{{$labels.bosh_job_name}}/{{$labels.bosh_job_index}}` will run out of ephemeral disk in {{humanizeDuration 14400.0}}"
          description: "BOSH Job `{{$labels.environment}}/{{$labels.bosh_name}}/{{$labels.bosh_deployment}}/{{$labels.bosh_job_name}}/{{$labels.bosh_job_index}}` ephemeral disk will be used more than 80% in {{humanizeDuration 14400.0}}"
...

I found some discussion upstream in prometheus around having their templating functions now support ints as well, so it might not be an issue after the next prometheus bump? https://github.com/prometheus/prometheus/issues/9679

benjaminguttmann-avtq commented 2 years ago

Hi @bg-govau ,

with next release we will bump to latest prometheus version and check if that will solve the issue. Otherwise we need to adjust the default values used.

benjaminguttmann-avtq commented 2 years ago

Hi @bg-govau

please check if v26.7.0 does fix the issue.

bg-govau commented 2 years ago

Thanks @benjaminguttmann-avtq , I'll give it a go, hopefully next week.

bg-govau commented 2 years ago

I updated to v26.7.0 and removed the ops file workaround. Alerts now look fine :tada: Thanks for that.