DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.81k stars 1.19k forks source link

kubelet and coredns agent checks not respecting proxy settings #6051

Open LykinsN opened 4 years ago

LykinsN commented 4 years ago

Output of the info page (if this is a bug)

# kubectl exec -it dd-agent-59p4n s6-svstat /var/run/s6/services/agent/ -n monitoring system-probe
Defaulting container name to agent.
Use 'kubectl describe pod/dd-agent-59p4n -n monitoring' to see all of the containers in this pod.
s6-svstat: fatal: unable to read status for /var/run/s6/services/agent/: s6-supervise not running
command terminated with exit code 1

Describe what happened:

I've been working to set up the full suite of Datadog Kubernetes containers, for the sake of enabling trace and process monitoring within our cluster. I've been able to get the containers implemented and running but, when checking the agent logs, I'm seeing the below traces repeating every few seconds:

2020-07-24 13:23:15 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check coredns: [{"message": "403 Client Error: Target service not allowed for url: http://172.30.16.2:9153/metrics", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 822, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/base_check.py\", line 91, in check\n    self.process(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 480, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 420, in scrape_metrics\n    response = self.poll(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 690, in poll\n    response.raise_for_status()\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py\", line 940, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 403 Client Error: Target service not allowed for url: http://172.30.16.2:9153/metrics\n"}]

2020-07-24 13:23:27 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "403 Client Error: Target service not allowed for url: http://<private ip address>:10255/metrics/cadvisor", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 822, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 355, in check\n    self.process(self.cadvisor_scraper_config, metric_transformers=self.transformers)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 480, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 420, in scrape_metrics\n    response = self.poll(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 690, in poll\n    response.raise_for_status()\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/models.py\", line 940, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 403 Client Error: Target service not allowed for url: http://<private ip address>:10255/metrics/cadvisor\n"}]

Our cloud infrastructure is running behind a proxy, and the "Target service not allowed" seems to be a trace returning from Sophos, our proxy solution. It seems as though the Python execution is not inheriting the proxy settings being defined for the container. To be clear, I've added every combination of proxy configuration details that I can think of, to each Kubernetes manifest, but without any success:

- name: HTTP_PROXY
  value: {{ aws_proxy }}
- name: http_proxy
  value: {{ aws_proxy }}
- name: HTTPS_PROXY
  value: {{ aws_proxy }}
- name: https_proxy
  value: {{ aws_proxy }}
- name: DD_PROXY_HTTP
  value: {{ aws_proxy }}
- name: DD_PROXY_HTTPS
  value: {{ aws_proxy }}
- name: DD_PROXY_NO_PROXY
  value: {{ no_proxy }}
- name: NO_PROXY
  value: {{ no_proxy }}
- name: no_proxy
  value: {{ no_proxy }}

If I add each individual IP address to the no_proxy fields within the container, I can successfully curl the endpoints but even doing this does not enable the above checks to work successfully. It seems as though the Python invocations within Datadog are not inheriting the values from DD_PROXY_NO_PROXY and it seems as though there might be a conflict in inheriting the subnet CIDR ranges especially. I've been able to manually add each IP address from our cloud subnets to the no_proxy fields without any success. However, this isn't practical in trying to allow the Kubernetes cluster CIDR since the subnet range spans an extremely large number of addresses.

I don't see any other network or proxy related errors within the logs, and all other connectivity native to the Go runtimes seem to be working. It's only the Python related execution that seems to be suffering from this issue.

Describe what you expected:

When running the containers with the above proxy configuration, I'd expect all network checks both within the Kubernetes cluster CIDR and within the allowed cloud subnet to communicate successfully.

Steps to reproduce the issue:

Deploy the Datadog agent containers to a cluster running within a proxied environment, and confirm that the no_proxy settings being declared in the manifests are not sufficient for enabling traffic to flow.

Additional environment details (Operating System, Cloud provider, etc):

Running within AWS, in a proxy controlled network space. The above behavior was encountered on datadog/agent:7.20.2

khewonc commented 3 years ago

Hi @LykinsN, another option to specifying IP addresses in the DD_PROXY_NO_PROXY or NO_PROXY env vars would be to set skip_proxy: true in the agent check's configuration, like in this example: https://github.com/DataDog/integrations-core/blob/5d992e4cabad3f5141ebe31b8c778a0aaf459e79/kyototycoon/datadog_checks/kyototycoon/data/conf.yaml.example#L28.

If you are still running into issues configuring the proxy settings for those checks, can you open a ticket with the support team: support@datadoghq.com

meyerbro commented 1 year ago

This seems to be still happening.

krish7919 commented 1 year ago

I was blocked on this too, and here's maybe something that might unblock you:

  1. DD_PROXY_NO_PROXY variable can be replaced with DD_NO_PROXY. This was suggested by support, however, I have not tried this.
  2. Each integration in Datadog has a skip_proxy flag that can be specified in the YAML config file to ignore/skip the configured proxy settings. I ended up using this for almost all my integrations and this has worked for me.