centreon / centreon-plugins

Collection of standard plugins to discover and gather cloud-to-edge metrics and status across your whole IT infrastructure.
https://www.centreon.com
Apache License 2.0
311 stars 274 forks source link

Prometheus API - Misinterpretation of returned data #4516

Closed StefThomas closed 1 year ago

StefThomas commented 1 year ago

Hi,

I have the following check, which is supposed to alert if there has been a restart of an Openshift pod in the last hour:

$ /usr/lib/centreon/plugins//centreon_prometheus_api.pl --plugin=cloud::prometheus::restapi::plugin --mode=expression --hostname=prometheus-k8s-openshift-monitoring.apps.ocp37hpd1.hm.dm.ad --url-path='/api/v1' --port='443' --proto='https' --verbose --header="Authorization: Bearer e***3Q"  --query='restarts,max_over_time(kube_pod_container_status_restarts_total{namespace="beryl-rec1"}[1h]) or on() absent(nonexistent{pod="null"})' --instance='pod' --output='Restarts du pod %{instance} (%{restarts})' --multiple-output='Aucun redémarrage pour le namespace "beryl-rec1"' --warning-status='%{restarts} > 0 and %{instance} !~ /^null$/' --critical-status='%{restarts} > 5 and %{instance} !~ /^null$/'
OK: Aucun redémarrage pour le namespace "beryl-rec1" | 'restarts_beryl-api-64d7fdb854-k8w76'=0;;;; 'restarts_beryl-batch-8b74cc97b-hqgrf'=0;;;; 'restarts_beryl-front-759b9b94d4-5jqpg'=0;;;;
Restarts du pod beryl-api-64d7fdb854-k8w76 (0)
Restarts du pod beryl-batch-8b74cc97b-hqgrf (0)
Restarts du pod beryl-front-759b9b94d4-5jqpg (0)

It returns an OK status although the data returned by Prometheus is the following:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "restarts",
          "container": "nginx",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-front-759b9b94d4-5jqpg",
          "service": "kube-state-metrics",
          "uid": "a11797ce-6963-4f78-9b9a-76c381f37cae"
        },
        "value": [
          1687960012.192,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "restarts",
          "container": "springboot",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-api-64d7fdb854-k8w76",
          "service": "kube-state-metrics",
          "uid": "0c57cc8a-183e-44ce-bf16-64df989c72be"
        },
        "value": [
          1687960012.192,
          "67"
        ]
      },
      {
        "metric": {
          "__name__": "restarts",
          "container": "springboot",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-batch-8b74cc97b-hqgrf",
          "service": "kube-state-metrics",
          "uid": "5c0b5059-6442-4a9e-8baf-d36013f3b54a"
        },
        "value": [
          1687960012.192,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "restarts",
          "container": "undertow-logs",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-api-64d7fdb854-k8w76",
          "service": "kube-state-metrics",
          "uid": "0c57cc8a-183e-44ce-bf16-64df989c72be"
        },
        "value": [
          1687960012.192,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "restarts",
          "container": "undertow-logs",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-batch-8b74cc97b-hqgrf",
          "service": "kube-state-metrics",
          "uid": "5c0b5059-6442-4a9e-8baf-d36013f3b54a"
        },
        "value": [
          1687960012.192,
          "0"
        ]
      }
    ]
  }
}

As there is a value of 67 for pod “beryl-api-64d7fdb854-k8w76” it shouldn’t return an OK status.

Version of the plugin used is from the package centreon-plugin-Cloud-Prometheus-Api-20230117-074217.el8.noarch.

With version from the package centreon-plugin-Cloud-Prometheus-Api-20230608-122119.el8.noarch it works, but then that’s another of our checks which does not anymore:

$ /usr/lib/centreon/plugins//centreon_prometheus_api.pl --plugin=cloud::prometheus::restapi::plugin --mode=expression --hostname=prometheus-k8s-openshift-monitoring.apps.ocp37hpd1.hm.dm.ad --url-path='/api/v1' --port='443' --proto='https'  --verbose --header="Authorization: Bearer eyJhb***b3Q"  --query='desired,kube_deployment_spec_replicas{deployment="beryl-api",namespace="beryl-rec1"}' --query='available,kube_deployment_status_replicas_available{deployment="beryl-api",namespace="beryl-rec1"}' --instance='deployment' --output='Réplicats pour beryl-rec1/%{instance}: désirés %{desired}, disponibles %{available}' --multiple-output='' --warning-status='%{available} < %{desired}' --critical-status='%{available} == 0' --verbose
OK:
status : skipped (no value(s))
status : skipped (no value(s))

while data returned by the API is the following:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "desired",
          "container": "kube-rbac-proxy-main",
          "deployment": "beryl-api",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "service": "kube-state-metrics"
        },
        "value": [
          1687961378.685,
          "1"
        ]
      }
    ]
  }
}

As I can’t obviously use a different version of the plugin for a particular check, I need both checks to work with the same version of the plugin (preferably the last version of course).

garnier-quentin commented 1 year ago

It's already fixed i think. The new version is released the 6 july. Otherwise you can use the unstable plugin.

StefThomas commented 1 year ago

Would be nice. I also tested the version Stéphane Duret sent me on June the 19th and the problem is still present. I’ll try with the unstable version.

StefThomas commented 1 year ago

Sadly the issue is still present in centreon-plugin-Cloud-Prometheus-Api-20230628-083750.el8.noarch

To be clear: the issue which led to a "skipped (no value)" result is fixed. But not he first one I reported (ie: the pod restart not being detected).

The strangest thing is that the last stable version (20230608) fixes the first issue (the restart one), but not the last unstable version (20230628). In other words, I can’t find a version with both the issues fixed.

Also, why tag the issue as "question"?

garnier-quentin commented 1 year ago

I have checked and the mode works as expected. As you can see, you have the same pod value (that's why you have 0 and not 67. The mode uses the last value of the json):

    {
        "metric": {
          "__name__": "restarts",
          "container": "springboot",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-api-64d7fdb854-k8w76",
          "service": "kube-state-metrics",
          "uid": "0c57cc8a-183e-44ce-bf16-64df989c72be"
        },
....
        "value": [
          1687960012.192,
          "67"
        ]
    },
    {
        "metric": {
          "__name__": "restarts",
          "container": "undertow-logs",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "beryl-rec1",
          "pod": "beryl-api-64d7fdb854-k8w76",
          "service": "kube-state-metrics",
          "uid": "0c57cc8a-183e-44ce-bf16-64df989c72be"
        },
        "value": [
          1687960012.192,
          "0"
        ]
      },

You should use the following option: --instance='pod' --instance='container' --output='Restarts du pod %{instance} container %{container} (%{restarts})'

StefThomas commented 1 year ago

OK. I’m closing the issue.