DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
932 stars 1.4k forks source link

Inconsistent Istio Integration #9636

Closed jschwartzy closed 3 years ago

jschwartzy commented 3 years ago

Output of the info page

istio (3.12.0)
    --------------
      Instance ID: istio:4a5c8d4df89a782b [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/istio.yaml
      Total Runs: 113
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 113
      Average Execution Time : 5ms
      Last Execution Date : 2021-07-02 15:48:55 UTC (1625240935000)
      Last Successful Execution Date : Never
      Error: HTTPConnectionPool(host='istio-pilot.istio-system', port=15014): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f34b067c9d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 159, in _new_conn
          conn = connection.create_connection(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/connection.py", line 61, in create_connection
          for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        File "/opt/datadog-agent/embedded/lib/python3.8/socket.py", line 918, in getaddrinfo
          for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
      socket.gaierror: [Errno -2] Name or service not known

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
          httplib_response = self._make_request(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 392, in _make_request
          conn.request(method, url, **httplib_request_kw)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1252, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1298, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1247, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 1007, in _send_output
          self.send(msg)
        File "/opt/datadog-agent/embedded/lib/python3.8/http/client.py", line 947, in send
          self.connect()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 187, in connect
          conn = self._new_conn()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 171, in _new_conn
          raise NewConnectionError(
      urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f34b067c9d0>: Failed to establish a new connection: [Errno -2] Name or service not known

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
          resp = conn.urlopen(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
          retries = retries.increment(
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 446, in increment
          raise MaxRetryError(_pool, url, error or ResponseError(cause))
      urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='istio-pilot.istio-system', port=15014): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f34b067c9d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 999, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/istio/legacy_1_4.py", line 66, in check
          self.process(process_pilot_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 533, in process
          for metric in self.scrape_metrics(scraper_config):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 470, in scrape_metrics
          response = self.poll(scraper_config)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 780, in poll
          response = self.send_request(endpoint, scraper_config, headers)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py", line 806, in send_request
          return http_handler.get(endpoint, stream=True, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 304, in get
          return self._request('get', url, options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 368, in _request
          response = self.make_request_aia_chasing(request_method, method, url, new_options, persist)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 373, in make_request_aia_chasing
          response = request_method(url, **new_options)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 75, in get
          return request('get', url, params=params, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 60, in request
          return session.request(method=method, url=url, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 533, in request
          resp = self.send(prep, **send_kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 646, in send
          r = adapter.send(request, **kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
          raise ConnectionError(e, request=request)
      requests.exceptions.ConnectionError: HTTPConnectionPool(host='istio-pilot.istio-system', port=15014): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f34b067c9d0>: Failed to establish a new connection: [Errno -2] Name or service not known'))
      Instance ID: istio:887807b8d0941c15 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/istio.d/auto_conf.yaml
      Total Runs: 114
      Metric Samples: Last Run: 242, Total: 27,588
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 101ms
      Last Execution Date : 2021-07-02 15:49:03 UTC (1625240943000)
      Last Successful Execution Date : 2021-07-02 15:49:03 UTC (1625240943000)
...
=============
Autodiscovery
=============

  Errors
  ======

    istio-blue/istiod-d5986795f-9nsqr
    ---------------------------------
        annotation ad.datadoghq.com/endpoints.check_names is invalid: endpoints doesn't match a container identifier [discovery]
        annotation ad.datadoghq.com/endpoints.init_configs is invalid: endpoints doesn't match a container identifier [discovery]
        annotation ad.datadoghq.com/endpoints.instances is invalid: endpoints doesn't match a container identifier [discovery]

Output of /etc/datadog-agent/conf.d/istio.yaml:

init_config:

instances:
  - galley_endpoint: http://istio-galley.istio-system:15014/metrics
    pilot_endpoint: http://istio-pilot.istio-system:15014/metrics
    citadel_endpoint: http://istio-citadel.istio-system:15014/metrics
    send_histograms_buckets: true

Istio Deployment Info:

> kubectl  describe deploy istiod
Name:                   istiod
Namespace:              istio-blue
CreationTimestamp:      Wed, 29 Jul 2020 20:10:11 +0000
Labels:                 app=istiod
                        istio=pilot
                        release=istio
Annotations:            deployment.kubernetes.io/revision: 1
                        kubectl.kubernetes.io/last-applied-configuration:
                          {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"istiod","istio":"pilot","release":"istio"},"name...
Selector:               istio=pilot
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
Pod Template:
  Labels:           app=istiod
                    istio=pilot
  Annotations:      ad.datadoghq.com/endpoints.check_names: ["istio"]
                    ad.datadoghq.com/endpoints.init_configs: [{}]
                    ad.datadoghq.com/endpoints.instances:
                      [
                        {
                          "istiod_endpoint": "http://%%host%%::8080/metrics",
                          "send_histograms_buckets": true
                        }
                      ]
                    sidecar.istio.io/inject: false
  Service Account:  istio-pilot-service-account
  Containers:
   discovery:
    Image:       [redacted].dkr.ecr.us-west-2.amazonaws.com/docker.io/istio/pilot:1.5.2
    Ports:       8080/TCP, 15010/TCP, 15017/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Args:
      discovery
      --monitoringAddr=:15014
      --log_output_level=default:info
      --domain
      cluster.local
      --secureGrpcAddr=:15011
      --trust-domain=cluster.local
      --keepaliveMaxServerConnectionAge
      30m
      --disable-install-crds=true
    Requests:
      cpu:      500m
      memory:   2Gi
    Readiness:  http-get http://:8080/ready delay=5s timeout=5s period=5s #success=1 #failure=3
    Environment Variables from:
      istiod  ConfigMap  Optional: true
    Environment:
      JWT_POLICY:                                   third-party-jwt
      PILOT_CERT_PROVIDER:                          istiod
      POD_NAME:                                      (v1:metadata.name)
      POD_NAMESPACE:                                 (v1:metadata.namespace)
      SERVICE_ACCOUNT:                               (v1:spec.serviceAccountName)
      PILOT_TRACE_SAMPLING:                         1
      CONFIG_NAMESPACE:                             istio-config
      PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND:  true
      PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND:   false
      INJECTION_WEBHOOK_CONFIG_NAME:                istio-sidecar-injector
      ISTIOD_ADDR:                                  istiod.istio-blue.svc:15012
      PILOT_EXTERNAL_GALLEY:                        false
      CLUSTER_ID:                                   Kubernetes
    Mounts:
      /etc/cacerts from cacerts (ro)
      /etc/istio/config from config-volume (rw)
      /var/lib/istio/inject from inject (ro)
      /var/lib/istio/local from istiod (ro)
      /var/run/secrets/istio-dns from local-certs (rw)
      /var/run/secrets/tokens from istio-token (ro)
  Volumes:
   local-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
   istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
   istiod:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istiod
    Optional:  true
   cacerts:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cacerts
    Optional:    true
   inject:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-sidecar-injector
    Optional:  true
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio
    Optional:  false
   pilot-envoy-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      pilot-envoy-config
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Progressing    True    NewReplicaSetAvailable
  Available      True    MinimumReplicasAvailable
OldReplicaSets:  istiod-d5986795f (2/2 replicas created)
NewReplicaSet:   <none>
Events:          <none>

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:

  1. Deploy Istio to a unique namespace
  2. Deploy DataDog Agent via Helm Chart
  3. Run agent status on the DataDog Agent pods

Describe the results you received: As shown above, the Istio integration is looking for Istio in the wrong namespace

Describe the results you expected: Istio should be located in the correct namespace

Additional information you deem important (e.g. issue happens only occasionally): This happens inconsistently. On some of our cluster nodes, the integration works as expected with the same version of Istio and DataDog Agent.

On a node that is working properly, the following output of agent status:

istio (3.12.0)
    --------------
      Instance ID: istio:4779e6085db738f1 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/istio.yaml
      Total Runs: 1,325
      Metric Samples: Last Run: 668, Total: 885,072
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1,325
      Average Execution Time : 34ms
      Last Execution Date : 2021-07-02 16:05:22 UTC (1625241922000)
      Last Successful Execution Date : 2021-07-02 16:05:22 UTC (1625241922000)
      metadata:
        version.major: 1
        version.minor: 5
        version.patch: 2
        version.raw: 1.5.2
        version.scheme: semver

      Instance ID: istio:ee6e2a3683a36879 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/istio.d/auto_conf.yaml
      Total Runs: 1,324
      Metric Samples: Last Run: 404, Total: 415,788
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 444ms
      Last Execution Date : 2021-07-02 16:05:22 UTC (1625241922000)
      Last Successful Execution Date : 2021-07-02 16:05:22 UTC (1625241922000)

The output of /etc/datadog-agent/conf.d/istio.yaml:

init_config:

instances:
  - istiod_endpoint: http://istio-pilot.istio-blue:8080/metrics
FlorianVeaux commented 3 years ago

Hi,

  1. /etc/datadog-agent/conf.d/istio.yaml is a manually written file, the agent doesn't generate it. If you have wrong values in this file, you have to modify it yourself.
  2. You're using autodiscovery on your istio deployment for the agent to monitor it automatically. You should not need both autodiscovery and direct configuration with an /etc/datadog-agent/conf.d/istio.yaml file. If you want to use autodiscovery though, note that we see the following error in what you've pasted:
    annotation ad.datadoghq.com/endpoints.check_names is invalid: endpoints doesn't match a container identifier [discovery]

Indeed your istio container is called discovery, so you should replace the following

-  Annotations:      ad.datadoghq.com/endpoints.check_names: ["istio"]
-                    ad.datadoghq.com/endpoints.init_configs: [{}]
-                    ad.datadoghq.com/endpoints.instances:
+  Annotations:      ad.datadoghq.com/discovery.check_names: ["istio"]
+                    ad.datadoghq.com/discovery.init_configs: [{}]
+                    ad.datadoghq.com/discovery.instances:
                       [
                         {
                           "istiod_endpoint": "http://%%host%%::8080/metrics",
  1. You have an extra : is istiod_endpoint.

Please open a support ticket if you need assistance with configuring the agent and/or the integration. Closing the issue as it doesn't appear to be a bug.