DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.87k stars 1.21k forks source link

[prometheus-check] [RFE] Prometheus checks fails for Spring Boot Actuator Endpoint due to lack of configurable expected format #3946

Closed onelapahead closed 5 years ago

onelapahead commented 5 years ago

Output of the info page (if this is a bug)

kubectl exec -it datadog-agent-t982j s6-svstat /var/run/s6/services/agent/
up (pid 335) 7780 seconds

Describe what happened: Our agents are failing to scrape the /actuator/prometheus endpoint of our Spring Boot apps because they are receiving 406's due to an invalid Accept header:

2019-07-26 19:41:23 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:294 in work) | Error running check prometheus: [{"message": "406 Client Error:  for url: http://172.23.1.147:8888/actuator/prometheus ", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 503, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/prometheus/base_check.py\", line 109, in check\n    ignore_unmapped=True,\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 398, in process\n    for metric in self.scrape_metrics(endpoint):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 362, in scrape_metrics\n    response = self.poll(endpoint)\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/prometheus/mixins.py\", line 547, in poll\n    response.raise_for_status()\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/requests/models.py\", line 940, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nHTTPError: 406 Client Error:  for url: http://172.23.1.147:8888/actuator/prometheus\n "}]

The reason being is that Actuator/Micrometer does not support the preferred protobuf format for Prometheus metrics, and instead only supports the text format i.e. text/plain, for example:

# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="ZHeap",} 2.47463936E8
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 2.880448E7
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 8.2347872E7
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 1304576.0
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 7324032.0

From what I can tell it's because the agent uses the datadog_checks_base Python lib which doesn't expose the type/Accept header for configuration (although it is configurable). The check method in there eventually calls poll which does have a preferred format option that could be set to None.

Describe what you expected:

The datadog agent should accept an additional annotation such as ad.datadoghq.com/<container-name>.prometheus-format: <text|protobuf> which will tell the agent to configure the check to use which ever format was provided. It should default to protobuf for backwards compatibility.

This will require a slight refactoring of the check within datadog_checks_base so that pFormat in poll is actually set by scrape_metrics and so on.

Steps to reproduce the issue:

Run a Spring Boot app on a Kubernetes/OpenShift cluster and enable Prometheus metrics via Micrometer and Actuator.

Configure the annotations to scrape the /actuator/prometheus endpoint, and check the agent logs to see that it fails with a 406.

If you use a debug pod to hit your Spring Boot pod directly you can reproduce the 406 with the same headers the agent uses:

kubectl run -i -t --image=brix4dayz/swiss-army-knife --restart=Never debug

$ curl -H 'Accept: application/vnd.google.protobuf; proto=io.prometheus.client.MetricFamily; encoding=delimited' http://172.23.1.147:8888/actuator/prometheus -v
* Expire in 0 ms for 6 (transfer 0x56147cfe46c0)
*   Trying 172.23.1.147...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56147cfe46c0)
* Connected to 172.23.1.147 (172.23.1.147) port 8888 (#0)
> GET /actuator/prometheus HTTP/1.1
> Host: 172.23.1.147:8888
> User-Agent: curl/7.64.0
> Accept: application/vnd.google.protobuf; proto=io.prometheus.client.MetricFamily; encoding=delimited
> 
< HTTP/1.1 406 
< Content-Length: 0
< Date: Fri, 26 Jul 2019 21:11:03 GMT
< 
* Connection #0 to host 172.23.1.147 left intact

Additional environment details (Operating System, Cloud provider, etc):

Spring Boot 2.1.4 OpenShift 3.11.98 docker.io/datadog/agent@sha256:904c18135ec534fc81c29a42537725946b23ba19ac3c0a8b2e942fe39981af20 (v6)

onelapahead commented 5 years ago

Duplicate of https://github.com/DataDog/integrations-core/issues/1144 and solved by https://github.com/DataDog/integrations-core/pull/1976.

My datadog annotations were:

        ad.datadoghq.com/container.check_names: '["prometheus"]'
        ad.datadoghq.com/container.init_configs: '[{}]'
        ad.datadoghq.com/container.instances: '[{"prometheus_url": "http://%%host%%:8888/actuator/prometheus","namespace": "a-namespace","metrics": ["*"],"type_overrides": {}}]'

and this was fixed by changing the check name:

        ad.datadoghq.com/container.check_names: '["openmetrics"]'