Open pdecat opened 5 years ago
The anticipated work-around with hostname_fqdn: true
does not work as this relies on using hostname -f
which in kubernetes containers returns the pod name which is even less stable, e.g.:
root@dd-agent-ggsrq:/opt/datadog-agent# /bin/hostname -f
dd-agent-ggsrq
https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L159-L161 https://github.com/DataDog/datadog-agent/blob/6.15.0/pkg/util/hostname_nix.go#L21
When the GCE metadata API call times out, the short (non-FQDN) hostname comes from the kubernetes apiserver hostname provider:
https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L173 https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname/kube_apiserver.go#L18 https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/kubernetes/apiserver/hostname.go#L33
To get stable hostnames, maybe the agent could trim the hostname retrieved from GCE metadata if hostname_fqdn
is false
.
A work-around is to put the node name into the DD_HOSTNAME
environment variable and not rely on the agent's hostname auto-detection:
YAML:
env:
- name: DD_HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
Terraform:
env {
name = "DD_HOSTNAME"
value_from {
field_ref {
field_path = "spec.nodeName"
}
}
}
Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.
This short timeout also causes issues when collecting instance attributes with GKE Metadata Server (needed for GKE Worload Identity):
root@dd-agent-6rh7z:/# agent check kubernetes_apiserver -l trace
[...]
2020-01-30 17:50:29 UTC | CORE | DEBUG | (pkg/metadata/host/host_tags.go:82 in getHostTags) | No GCE host tags Get http://169.254.169.254/computeMetadata/v1/?recursive=true: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[...]
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Total Runs: 1
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 601ms
Check has run only once, if some metrics are missing you can try again with --check-rate to see any other metric if available.
Invoking that with curl takes more than 300ms from time to time:
root@dd-agent-6rh7z:/# time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 200 OK
Content-Type: application/json
Metadata-Flavor: Google
Server: GKE Metadata Server
Date: Thu, 30 Jan 2020 18:17:45 GMT
Content-Length: 744
{"instance":{"attributes":{"clusterLocation":"europe-west1","clusterName":"myproject-preprod-europe-west1-gke1","clusterUid":"123456789"},"hostname":"gke-myproject-preprod-eur-gke1-pool-b-e9fd4057-lbwc.c.myproject-preprod.internal","id":123456789,"serviceAccounts":{"default":{"aliases":["default"],"email":"myproject-preprod.svc.id.goog","scopes":"https://www.googleapis.com/auth/cloud-platform"},"myprojectPreprod.svc.id.goog":{"aliases":["default"],"email":"myproject-preprod.svc.id.goog","scopes":"https://www.googleapis.com/auth/cloud-platform"}},"zone":"projects/123456789/zones/europe-west1-d"},"project":{"numericProjectId":123456789,"projectId":"myproject-preprod"}}
real 0m0.345s
user 0m0.005s
sys 0m0.001s
Also a second error is shown right above revealing another incompatibility with GKE Metadata Server:
2020-01-30 17:50:28 UTC | CORE | DEBUG | (pkg/metadata/host/host.go:104 in getHostAliases) | no GCE Host Alias: unable to retrieve instance name from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/name
root@dd-agent-6rh7z:/# curl -i -s 'http://169.254.169.254/computeMetadata/v1/instance/name' -H 'Metadata-Flavor: Google'
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 30 Jan 2020 17:51:42 GMT
Content-Length: 52
GKE Metadata Server encountered an error: Not Found
Running the first curl command directly from the GCE host returns cluster-name
instead of clusterName
:
patrick@gke-myproject-preprod-eur-gke1-pool-b-9f645bf0-gvc8 ~ $ time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 200 OK
Metadata-Flavor: Google
Content-Type: application/json
ETag: 123456789
Date: Thu, 30 Jan 2020 18:17:28 GMT
Server: Metadata Server for VM
Content-Length: 42245
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
{"instance":{"attributes":{"cluster-location":"europe-west1","cluster-name":"myproject-preprod-europe-west1-gke1","cluster-uid":"123456789","configure-sh":
[...]
Finally when GKE Metadata Concealment is enabled, the recursive call is simply forbidden:
root@dd-agent-2scln:/# time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 403 Forbidden
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 30 Jan 2020 18:27:21 GMT
Content-Length: 58
This metadata endpoint is concealed for ?recursive calls.
real 0m0.045s
user 0m0.004s
sys 0m0.005s
Having same issue with hostnames changing between the short and long variants.
Also have the issue with GKE Metadata Server returning renamed metadata keys when a recursive query is used (containerName
vs container-name
). I have a bug in with Google regarding this.
Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.
:warning: having short, non-FQDN, hostnames for agents seems to affect Datadog billing as these differ from the ones reported by the GCE integration.
Having same issue with hostnames changing between the short and long variants.
Also have the issue with GKE Metadata Server returning renamed metadata keys when a recursive query is used (
containerName
vscontainer-name
). I have a bug in with Google regarding this.
The bug should be fixed in an upcoming GKE release.
Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.
⚠️ having short, non-FQDN, hostnames for agents seems to affect Datadog billing as these differ from the ones reported by the GCE integration.
I have a ticket open with Datadog about this.
Opened one on Datadog side too today.
~What's the expected resolution on the GKE side? Lower response times?~
Edit: re-read your message, nevermind.
The anticipated work-around with
hostname_fqdn: true
does not work as this relies on usinghostname -f
which in kubernetes containers returns the pod name which is even less stable, e.g.:root@dd-agent-ggsrq:/opt/datadog-agent# /bin/hostname -f dd-agent-ggsrq
https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L159-L161 https://github.com/DataDog/datadog-agent/blob/6.15.0/pkg/util/hostname_nix.go#L21
FTR, there was a recent change to actually prevent using hostname_fqdn: true
in containers when they are not using the host network: https://github.com/DataDog/datadog-agent/pull/4503
Containers using the host network do actually share the host name:
# kubectl --context gke_myproject_europe-west1_myproject-europe-west1-mycluster exec -ti -n test testpod-hostnetwork-bqn79 -- hostname -f
gke-myproject-europe-mycluster-pool-b-f358f407-9l2w.c.myproject.internal
FWIW, SignalFx just made their previously hard-coded 1s metadata HTTP request timeout into a configurable setting, defaulting to 2s for AWS, Azure and GCP: https://github.com/signalfx/signalfx-agent/pull/1296
I'm curious how others are handling there GKE node names with the datadog agent. Are people using the downward API to set DD_HOSTNAME
from nodeName
? I find that by default the node name (shortname of the host) is not an alias by default, as DD concatenates the DD_ENV to the nodename. This makes associating metrics/logs/traces that emit a node name difficult. I've configured the agent to use the downward api, which required a fork of the helm chart. Doing this though loses the metadata call to get the FQDN. Seems janky. Are we losing anything by not having the FQDN sent to DD for GKE instances?
Output of the info page (if this is a bug)
Note the
gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
message.Describe what happened:
During replacement of datadog agent pods on 12 GKE nodes, 2 of them had their name changed:
gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd.c.myproject.internal
=>gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd
gke-myproject-europe-mycluster-pool-b-119bb72d-69pc
=>gke-myproject-europe-mycluster-pool-b-119bb72d-69pc.c.myproject.internal
This caused
No data
alerts for the old hostnames.That's not the first time we face hostname changes on datadog agent replacement for upgrades or configuration changes, it's just that we never investigated the issue further.
On the agent status for the first host which went from FQDN to short name, the call to the http://169.254.169.254/computeMetadata/v1/instance/hostname endpoint is reported to have failed because of a timeout.
On the other host, no error is reported so it must had failed on the previous datadog agent initialization and went back to OK this time:
Describe what you expected:
The datadog agent should keep stable hostnames on pod replacement.
I can see at least two options to fix this:
hostname_fqdn
is set to its default (false
).Note: the timeout is currently hard-coded to 300ms.
Steps to reproduce the issue:
hostname_fqdn
set to its default (false
),Additional environment details (Operating System, Cloud provider, etc):
I did some quick tests querying the metadata API:
From a datadog agent container (concealed by GKE), over a 30s period, response time is near or over 300ms a few times:
Directly from the GCE host (unconcealed), response time is constantly below 30ms (10x less):
Work-around
Do not rely on the agent's hostname auto-detection by putting the node name into the DD_HOSTNAME environment variable: https://github.com/DataDog/datadog-agent/issues/4429#issuecomment-557541936
~Set
hostname_fqdn: true
in the agent's configuration to always get FQDN hostnames. That's the recommended value by the way.~