PureStorage-OpenConnect / pure-fa-openmetrics-exporter

Pure Storage OpenMetrics exporter for FlashArray
Apache License 2.0
18 stars 26 forks source link

500 Internal Server Error for Volume Metrics #5

Closed jmg011 closed 1 year ago

jmg011 commented 1 year ago

Problem: Server returned HTTP status 500 Internal Server Error while trying to scrape metrics for below end-point for volumes.

http://fa-mgnt-endpoint:port/metrics/volumes

The other end-points works fine.

Things I tried to debug the issue but did not help:

1) Increased timeout and scrape interval from 2 minutes to 10 minutes. 2) Removed all end points and set up only volume metrics end-point

jmg011 commented 1 year ago

Hi @cmautner1 - Can you help debug this issue?

genegr commented 1 year ago

This should be solved by the new exporter, completely refactored in Go. Please give it a try if/when you can and provide a feedback. I am keeping the issue open for a while just for this purpose.

jmg011 commented 1 year ago

@genegr we are using the latest release. Can you provide the release name that has the fix?

jmg011 commented 1 year ago

@genegr it looks like a new version released 10 days back. Can you confirm this is the one that has the fix? https://quay.io/repository/purestorage/pure-fa-om-exporter?tab=tags&tag=latest

Also, is there any metric names or label names changed between this release and the one we are using quay.io/purestorage/pure-fa-ome latest

Asking since the new exporter of pure blade has major changes in metric/label names and we are having to update all of our alerting and dashboards. I just want to know upfront if that is the case with block exporter as well.

genegr commented 1 year ago

@jmg011 the current release is 1.0.1. Starting from v1.0.0 the code has been completely refactored in Go and should not show those odd behaviors as the old code. It uses Resty to handle the communication with the FA API server and this makes the code much more adaptive to the different JSON responses returned by the API. About names and labels, we tried to keep the previous naming for the metrics but have instead changed those labels that were too generic into more meaningful names. This mainly happened for the host metrics, in which we changed the generic "name" label into "host".

jmg011 commented 1 year ago

I tested it. The metrics for volume are working on the new version. Thank you for fixing it. But many changes are made to the metric and label names which I had to map to current metrics and labels so that we can replace in production later.

Maybe going forward, the better approach would be to provide backward compatibility for new exporters. For example in rare case you need to change metric name or labels, it should be named something else and should have previous metric name as well, so that the user can determine when to cutover the alerts to new metric after upgrading the exporter. This will give flexibility to get the new metrics as part of new exporter release without having to immediately change alerting on current metrics/labels. What do you think?

jmg011 commented 1 year ago

@genegr Hi, is there a documentation on what health alert $values for purefb_hardware_health metric mean?

Current Possible values I see are 0, 1, 2

genegr commented 1 year ago

In the new Go version the metric is now named as purefa_hw_component_status and its value is always set to 1. There is now a specific status label that reflects the internal value returned by the REST API foe the same parameter. The same change has also been applied to the FlashBlade exporter, for which the metric is named purefb_hw_component_status

jmg011 commented 1 year ago

@genegr Thank you.

jmg011 commented 1 year ago

@genegr I just tested the new exporter with this metric change. I do not see severity label anymore for purefa_hw_component_status metric. Is there any reason why severity is removed for this new metric?

or we should use purefa_alerts_open for open alerts with severity and purefa_hw_component_status separately for component health check?

genegr commented 1 year ago

@jmg011 The metrics specific to hardware components have now changed into these:

purefa_hw_component_status purefa_hw_component_temperature_celsius purefa_hw_component_voltage_volt

jmg011 commented 1 year ago

@genegr Thanks. So if there are any hardware failures which reflects under purefa_hw_components* metrics, will those also show up under purefa_alerts_open metric as open alerts?

sdodsley commented 1 year ago

@jmg011 the arrays metric had a bug. This has been resolved in the v1.0.3 release

genegr commented 1 year ago

Going to close this as it has been fixed by the introduction of release 1.0.0