DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
934 stars 1.4k forks source link

[BUG] datadog_checks http RequestsWrapper looks for intermediate certificates only on port 443 #14734

Closed ziz closed 1 year ago

ziz commented 1 year ago

Output of the info page

`sudo datadog-agent status` output ```text Getting the status from the agent. =============== Agent (v7.45.0) =============== Status date: 2023-06-11 19:15:26.58 UTC (1686510926580) Agent start: 2023-06-11 19:08:45.735 UTC (1686510525735) Pid: 4671 Go Version: go1.19.9 Python Version: 3.8.16 Build arch: amd64 Agent flavor: agent Check Runners: 4 Log Level: info Paths ===== Config File: /etc/datadog-agent/datadog.yaml conf.d: /etc/datadog-agent/conf.d checks.d: /etc/datadog-agent/checks.d Clocks ====== NTP offset: 52.850731s System time: 2023-06-11 19:15:26.58 UTC (1686510926580) Host Info ========= bootTime: 2021-10-19 16:39:37 UTC (1634661577000) hostId: 4a1fa099-0726-4010-92eb-a6c5169d705f kernelArch: x86_64 kernelVersion: 3.10.0-1160.11.1.el7.x86_64 os: linux platform: centos platformFamily: rhel platformVersion: 7.9.2009 procs: 460 uptime: 14402h29m9s Hostnames ========= ec2-hostname: ip-10-100-10-17.us-east-2.compute.internal host_aliases: [i-XXXXXXXXXXXXXXXXX] hostname: server1 instance-id: i-XXXXXXXXXXXXXXXXX socket-fqdn: server1.local. socket-hostname: server1 hostname provider: os unused hostname providers: 'hostname' configuration/environment: hostname is empty 'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames azure: azure_hostname_style is set to 'os' container: the agent is not containerized fargate: agent is not runnning on Fargate fqdn: 'hostname_fqdn' configuration is not enabled gce: unable to retrieve hostname from GCE: GCE metadata API error: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname Metadata ======== agent_version: 7.45.0 cloud_provider: AWS config_apm_dd_url: config_dd_url: config_logs_dd_url: config_logs_socks5_proxy_address: config_no_proxy: [169.254.169.254 100.100.100.200] config_process_dd_url: config_proxy_http: config_proxy_https: config_site: feature_apm_enabled: true feature_cspm_enabled: false feature_cws_enabled: false feature_dynamic_instrumentation_enabled: false feature_enable_http_stats_by_status_code: false feature_logs_enabled: true feature_networks_enabled: false feature_networks_http_enabled: false feature_networks_https_enabled: false feature_otlp_enabled: false feature_process_enabled: true feature_processes_container_enabled: true feature_remote_configuration_enabled: false feature_usm_go_tls_enabled: false feature_usm_http2_enabled: false feature_usm_java_tls_enabled: false feature_usm_kafka_enabled: false flavor: agent hostname_source: os install_method_installer_version: datadog_formula-3.5 install_method_tool: saltstack install_method_tool_version: saltstack-3005 logs_transport: HTTP ========= Collector ========= Running Checks ============== consul (2.2.0) -------------- Instance ID: consul:dd70a33f647dcf20 [OK] Configuration Source: file:/etc/datadog-agent/conf.d/consul.d/conf.yaml Total Runs: 27 Metric Samples: Last Run: 1, Total: 27 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 2, Total: 56 Average Execution Time : 11ms Last Execution Date : 2023-06-11 19:15:17 UTC (1686510917000) Last Successful Execution Date : 2023-06-11 19:15:17 UTC (1686510917000) metadata: version.major: 1 version.minor: 9 version.patch: 1 version.raw: 1.9.1 version.scheme: semver cpu --- Instance ID: cpu [OK] Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default Total Runs: 27 Metric Samples: Last Run: 9, Total: 236 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:24 UTC (1686510924000) Last Successful Execution Date : 2023-06-11 19:15:24 UTC (1686510924000) disk (4.9.0) ------------ Instance ID: disk:67cc0574430a16ba [OK] Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default Total Runs: 26 Metric Samples: Last Run: 276, Total: 7,176 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 30ms Last Execution Date : 2023-06-11 19:15:16 UTC (1686510916000) Last Successful Execution Date : 2023-06-11 19:15:16 UTC (1686510916000) file_handle ----------- Instance ID: file_handle [OK] Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default Total Runs: 27 Metric Samples: Last Run: 5, Total: 135 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:23 UTC (1686510923000) Last Successful Execution Date : 2023-06-11 19:15:23 UTC (1686510923000) io -- Instance ID: io [OK] Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default Total Runs: 26 Metric Samples: Last Run: 197, Total: 4,987 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:15 UTC (1686510915000) Last Successful Execution Date : 2023-06-11 19:15:15 UTC (1686510915000) load ---- Instance ID: load [OK] Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default Total Runs: 27 Metric Samples: Last Run: 6, Total: 162 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:22 UTC (1686510922000) Last Successful Execution Date : 2023-06-11 19:15:22 UTC (1686510922000) memory ------ Instance ID: memory [OK] Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default Total Runs: 26 Metric Samples: Last Run: 20, Total: 520 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:14 UTC (1686510914000) Last Successful Execution Date : 2023-06-11 19:15:14 UTC (1686510914000) network (2.9.4) --------------- Instance ID: network:4b0649b7e11f0772 [OK] Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default Total Runs: 27 Metric Samples: Last Run: 81, Total: 2,187 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 2ms Last Execution Date : 2023-06-11 19:15:21 UTC (1686510921000) Last Successful Execution Date : 2023-06-11 19:15:21 UTC (1686510921000) ntp --- Instance ID: ntp:3c427a42a70bbf8 [OK] Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default Total Runs: 1 Metric Samples: Last Run: 1, Total: 1 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 1, Total: 1 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:08:47 UTC (1686510527000) Last Successful Execution Date : 2023-06-11 19:08:47 UTC (1686510527000) uptime ------ Instance ID: uptime [OK] Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default Total Runs: 26 Metric Samples: Last Run: 1, Total: 26 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 0s Last Execution Date : 2023-06-11 19:15:13 UTC (1686510913000) Last Successful Execution Date : 2023-06-11 19:15:13 UTC (1686510913000) ======== JMXFetch ======== Information ================== Initialized checks ================== no checks Failed checks ============= no checks ========= Forwarder ========= Transactions ============ Cluster: 0 ClusterRole: 0 ClusterRoleBinding: 0 CronJob: 0 CustomResource: 0 CustomResourceDefinition: 0 DaemonSet: 0 Deployment: 0 Dropped: 0 HighPriorityQueueFull: 0 Ingress: 0 Job: 0 Namespace: 0 Node: 0 OrchestratorManifest: 0 PersistentVolume: 0 PersistentVolumeClaim: 0 Pod: 0 ReplicaSet: 0 Requeued: 0 Retried: 0 RetryQueueSize: 0 Role: 0 RoleBinding: 0 Service: 0 ServiceAccount: 0 StatefulSet: 0 VerticalPodAutoscaler: 0 Transaction Successes ===================== Total number: 56 Successes By Endpoint: check_run_v1: 26 intake: 3 metadata_v1: 1 series_v2: 26 On-disk storage =============== On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it. API Keys status =============== API key ending with d7289: API Key valid ========== Endpoints ========== https://app.datadoghq.com - API Key ending with: - d7289 ========== Logs Agent ========== Reliable: Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.com on port 443 BytesSent: 357 EncodedBytesSent: 282 LogsProcessed: 1 LogsSent: 1 CoreAgentProcessOpenFiles: 24 OSFileLimit: 4096 consul ------ - Type: file Path: /var/log/consul/*.log Service: consul Source: consul Status: OK 1 files tailed out of 1 files matching Inputs: /var/log/consul/consul-1686503969895345311.log Bytes Read: 160 Pipeline Latency: Average Latency (ms): 0 24h Average Latency (ms): 0 Peak Latency (ms): 0 24h Peak Latency (ms): 0 ============= Process Agent ============= Version: 7.45.0 Status date: 2023-06-11 19:15:26.893 UTC (1686510926893) Process Agent Start: 2023-06-11 19:08:45.856 UTC (1686510525856) Pid: 4673 Go Version: go1.19.9 Build arch: amd64 Log Level: info Enabled Checks: [process rtprocess] Allocated Memory: 20,552,296 bytes Hostname: server1 System Probe Process Module Status: Not running ================= Process Endpoints ================= https://process.datadoghq.com - API Key ending with: - d7289 ========= Collector ========= Last collection time: 2023-06-11 19:15:17 Docker socket: Number of processes: 325 Number of containers: 0 Process Queue length: 0 RTProcess Queue length: 0 Connections Queue length: 0 Event Queue length: 0 Pod Queue length: 0 Process Bytes enqueued: 0 RTProcess Bytes enqueued: 0 Connections Bytes enqueued: 0 Event Bytes enqueued: 0 Pod Bytes enqueued: 0 Drop Check Payloads: [] ========= APM Agent ========= Status: Running Pid: 4674 Uptime: 401 seconds Mem alloc: 9,314,488 bytes Hostname: server1 Receiver: localhost:8126 Endpoints: https://trace.agent.datadoghq.com Receiver (previous minute) ========================== No traces received in the previous minute. Writer (previous minute) ======================== Traces: 0 payloads, 0 traces, 0 events, 0 bytes Stats: 0 payloads, 0 stats buckets, 0 bytes ========== Aggregator ========== Checks Metric Sample: 15,963 Dogstatsd Metric Sample: 4,253 Event: 1 Events Flushed: 1 Number Of Flushes: 26 Series Flushed: 14,020 Service Check: 297 Service Checks Flushed: 315 ========= DogStatsD ========= Event Packets: 0 Event Parse Errors: 0 Metric Packets: 4,252 Metric Parse Errors: 0 Service Check Packets: 0 Service Check Parse Errors: 0 Udp Bytes: 376,444 Udp Packet Reading Errors: 0 Udp Packets: 2,470 Uds Bytes: 0 Uds Origin Detection Errors: 0 Uds Packet Reading Errors: 0 Uds Packets: 0 Unterminated Metric Errors: 0 ==== OTLP ==== Status: Not enabled Collector status: Not running ```

Additional environment details (Operating System, Cloud provider, etc): This is reproduced on a CentOS 7 box in AWS with a Consul cluster. The actual behavior described is likely not specific to any of those details.

Steps to reproduce the issue:

  1. Point a check at an HTTPS server that (a) is not on port 443 and (b) does not have an SSL certificate that immediately validates when using system-default SSL CAs
    • In this case, the check is the consul check, which runs on https against port 8501, using consul's internal CA and automatic certificate distribution
  2. POSSIBLY OPTIONAL: specify a (not useful) client certificate
    • In this case, the client certificate is set in the conf.d/consul.d/conf.yaml with tls_cert set to the consul service mesh CA certificate (this was an error of configuration, but the configuration error itself was not the cause of the issue; see the additional details)
  3. Restart the agent and observe logs

Describe the results you received:

The datadog agent log contains the error:

2023-06-11 18:21:08 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:123 in LogMessage) | consul:7cfafeeca0bbeaaa | (http.py:464) | Error occurred while connecting to socket to discover intermediate certificates: [Errno 111] Connection refused

which, upon investigation, appears to be because fetch_intermediate_certs always connects to port 443:

7.45.x currently at 044247efccff3bcdf0ae19b5481879c151f87814 https://github.com/DataDog/integrations-core/blob/044247efccff3bcdf0ae19b5481879c151f87814/datadog_checks_base/datadog_checks/base/utils/http.py#L457-L466

In a situation where fetching intermediate certificates would have been effective but the server in question is on a nonstandard HTTPS port, this would fail.

Describe the results you expected:

No Error occurred while connecting to socket should show up in the logs when that is not relevant to the situation.

In this situation, it looks to me like fetch_intermediate_certs should support an optional port, and at least its immediate use in make_request_aia_chasing in the same file:

7.45.x currently at 044247efccff3bcdf0ae19b5481879c151f87814 https://github.com/DataDog/integrations-core/blob/044247efccff3bcdf0ae19b5481879c151f87814/datadog_checks_base/datadog_checks/base/utils/http.py#L423-L430

should pass in the port for the URL in question.

Additional information you deem important (e.g. issue happens only occasionally):

This bug was discovered due to a config file typo; we had intended to specify tls_ca_cert but had specified tls_cert instead. This config file typo is only relevant inasmuch as it revealed the bug; when we changed the config file to specify tls_ca_cert everything functions as expected. However, it was more confusing to track down our configuration issue, since the errors were

ssl.SSLError: [SSL] PEM lib (_ssl.c:4067) ```text 2023-06-11 18:21:08 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:123 in LogMessage) | consul:7cfafeeca0bbeaaa | (consul.py:154) | Consul request to https://localhost:8501/v1/agent/self failed Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 414, in connect self.sock = ssl_wrap_socket( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 418, in ssl_wrap_socket context.load_cert_chain(certfile, keyfile) ssl.SSLError: [SSL] PEM lib (_ssl.c:4067) ```
During handling of the above exception, another exception occurred: ```text During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='localhost', port=8501): Max retries exceeded with url: /v1/agent/self (Caused by SSLError(SSLError(9, '[SSL] PEM lib (_ssl.c:4067)'))) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/consul/consul.py", line 141, in consul_request resp = self.http.get(url) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 355, in get return self._request('get', url, options) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 419, in _request response = self.make_request_aia_chasing(request_method, method, url, new_options, persist) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 432, in make_request_aia_chasing raise e File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 425, in make_request_aia_chasing response = request_method(url, **new_options) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 563, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='localhost', port=8501): Max retries exceeded with url: /v1/agent/self (Caused by SSLError(SSLError(9, '[SSL] PEM lib (_ssl.c:4067)')))``` ```

as opposed to the much more transparent error message if tls_cert is not specified:

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131) ```text 2023-06-11 19:26:37 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:130 in LogMessage) | consul:c4992031b651f3c8 | (http.py:464) | Error occurred while connecting to socket to discover intermediate certificates: [Errno 111] Connection refused 2023-06-11 19:26:37 UTC | CORE | ERROR | (pkg/collector/python/datadog_agent.go:130 in LogMessage) | consul:c4992031b651f3c8 | (consul.py:154) | Consul request to https://127.0.0.1:8501/v1/agent/self failed Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen httplib_response = self._make_request( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 403, in _make_request self._validate_conn(conn) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1053, in _validate_conn conn.connect() File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 453, in ssl_wrap_socket ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/ssl_.py", line 495, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock) File "/opt/datadog-agent/embedded/lib/python3.8/ssl.py", line 500, in wrap_socket return self.sslsocket_class._create( File "/opt/datadog-agent/embedded/lib/python3.8/ssl.py", line 1040, in _create self.do_handshake() File "/opt/datadog-agent/embedded/lib/python3.8/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131) ```
During handling of the above exception, another exception occurred: ```text During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/connectionpool.py", line 798, in urlopen retries = retries.increment( File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='127.0.0.1', port=8501): Max retries exceeded with url: /v1/agent/self (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)'))) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/consul/consul.py", line 141, in consul_request resp = self.http.get(url) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 355, in get return self._request('get', url, options) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 419, in _request response = self.make_request_aia_chasing(request_method, method, url, new_options, persist) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 432, in make_request_aia_chasing raise e File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/utils/http.py", line 425, in make_request_aia_chasing response = request_method(url, **new_options) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/requests/adapters.py", line 563, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='127.0.0.1', port=8501): Max retries exceeded with url: /v1/agent/self (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: una ble to get local issuer certificate (_ssl.c:1131)'))) ```
yzhan289 commented 1 year ago

Hey @ziz, thanks for bringing this up and also writing a detailed bug report. I am currently working on a fix for this now.

alopezz commented 1 year ago

Closing this as it's supposed to be solved by #14817.