GCE/GKE: agent hostname sometimes changes when instance metadata retrieval times out

pdecat commented 5 years ago

Output of the info page (if this is a bug)

patrick@gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd ~ $ docker exec -ti k8s_dd-agent_dd-agent-8mkzv_monitoring_023f5fe5-0538-11ea-a181-42010a98000a_0 agent status
Getting the status from the agent.

===============
Agent (v6.11.1)
===============

  Status date: 2019-11-12 15:49:58.951261 UTC
  Agent start: 2019-11-12 10:34:32.758761 UTC
  Pid: 349
  Python Version: 2.7.16
  Check Runners: 16
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -27µs
    System UTC time: 2019-11-12 15:49:58.951261 UTC

  Host Info
  =========
    bootTime: 2019-08-21 07:55:36.000000 UTC
    kernelVersion: 4.14.127&#43;
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 65
    uptime: 1994h39m4s

  Hostnames
  =========
    host_aliases: [gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd.myproject gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd-myproject-europe-west1-mycluster]
    hostname: gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd
    socket-fqdn: dd-agent-8mkzv
    socket-hostname: dd-agent-8mkzv
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

=========
Collector
=========
...

Note the gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled (Client.Timeout exceeded while awaiting headers) message.

Describe what happened:

During replacement of datadog agent pods on 12 GKE nodes, 2 of them had their name changed:

gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd.c.myproject.internal => gke-myproject-europe-mycluster-pool-b-ca79cf07-skfd
gke-myproject-europe-mycluster-pool-b-119bb72d-69pc => gke-myproject-europe-mycluster-pool-b-119bb72d-69pc.c.myproject.internal

This caused No data alerts for the old hostnames.

That's not the first time we face hostname changes on datadog agent replacement for upgrades or configuration changes, it's just that we never investigated the issue further.

On the agent status for the first host which went from FQDN to short name, the call to the http://169.254.169.254/computeMetadata/v1/instance/hostname endpoint is reported to have failed because of a timeout.

On the other host, no error is reported so it must had failed on the previous datadog agent initialization and went back to OK this time:

patrick@gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs ~ $ docker exec -ti k8s_dd-agent_dd-agent-plbn4_monitoring_519a7b11-0253-11ea-98fb-42010a880008_0 agent status
Getting the status from the agent.

===============
Agent (v6.11.1)
===============

  Status date: 2019-11-12 16:27:35.051126 UTC
  Agent start: 2019-11-08 18:12:28.827761 UTC
  Pid: 344
  Python Version: 2.7.16
  Check Runners: 16
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -26µs
    System UTC time: 2019-11-12 16:27:35.051126 UTC

  Host Info
  =========
    bootTime: 2019-08-20 14:25:56.000000 UTC
    kernelVersion: 4.14.127&#43;
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 65
    uptime: 1923h46m40s

  Hostnames
  =========
    host_aliases: [gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs.myproject gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs-myproject-europe-west1-mycluster]
    hostname: gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs.c.myproject.internal
    socket-fqdn: dd-agent-plbn4
    socket-hostname: dd-agent-plbn4
    hostname provider: gce
    unused hostname providers:
      configuration/environment: hostname is empty

Describe what you expected:

The datadog agent should keep stable hostnames on pod replacement.

I can see at least two options to fix this:

Ensure the hostname is short, not FQDN, if hostname_fqdn is set to its default (false).
Increase the timeout used to query the hostname on GCE instances (and/or maybe make it configurable).

Note: the timeout is currently hard-coded to 300ms.

Steps to reproduce the issue:

deploy the datadog agent as a DaemonSet on GKE with metadata concealment enabled and hostname_fqdn set to its default (false),
replace the datadog agent pods several times,
at some point, some agents will fail to query the metadata endpoint and see their hostname change.

Additional environment details (Operating System, Cloud provider, etc):

Datadog agent v6.11.1 deployed as a kubernetes DaemonSet
GKE master version: v1.13.11-gke.9
GKE nodes version: v1.13.7-gke.19
GKE metadata concealment enabled

I did some quick tests querying the metadata API:

From a datadog agent container (concealed by GKE), over a 30s period, response time is near or over 300ms a few times:

patrick@gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs ~ $ while true; do docker exec -ti k8s_dd-agent_dd-agent-plbn4_monitoring_519a7b11-0253-11ea-98fb-42010a880008_0 curl --write-out "%{http_code},%{time_total},%{time_connect},%{time_appconnect},%{time_starttransfer}\n" --silent --output dev/null -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/hostname | grep -v 200,0.0 ; done
200,0.200194,0.000296,0.000000,0.200134
200,0.291702,0.000480,0.000000,0.291608
200,0.201483,0.000257,0.000000,0.201418
200,0.182730,0.000244,0.000000,0.182647
200,0.191352,0.000272,0.000000,0.191270
200,0.398903,0.000263,0.000000,0.398823
^C

Directly from the GCE host (unconcealed), response time is constantly below 30ms (10x less):

patrick@gke-myproject-eur-mycluster-pool-a-3c90d53b-pgzs ~ $ while true; do curl --write-out "%{http_code},%{time_total},%{time_connect},%{time_appconnect},%{time_starttransfer}\n" --silent --output dev/null -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/hostname | grep -v 200,0.00 ; done
200,0.010937,0.009941,0.000000,0.010894
200,0.011541,0.001175,0.000000,0.011499
200,0.028939,0.002542,0.000000,0.028881
200,0.011865,0.000261,0.000000,0.011793
200,0.016981,0.013078,0.000000,0.016908
200,0.015131,0.008786,0.000000,0.015087
200,0.016189,0.015270,0.000000,0.016146
200,0.015605,0.014615,0.000000,0.015556
200,0.016237,0.015152,0.000000,0.016196
^C

Work-around

Do not rely on the agent's hostname auto-detection by putting the node name into the DD_HOSTNAME environment variable: https://github.com/DataDog/datadog-agent/issues/4429#issuecomment-557541936

~Set hostname_fqdn: true in the agent's configuration to always get FQDN hostnames. That's the recommended value by the way.~

pdecat commented 4 years ago

The anticipated work-around with hostname_fqdn: true does not work as this relies on using hostname -f which in kubernetes containers returns the pod name which is even less stable, e.g.:

root@dd-agent-ggsrq:/opt/datadog-agent# /bin/hostname -f
dd-agent-ggsrq

https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L159-L161 https://github.com/DataDog/datadog-agent/blob/6.15.0/pkg/util/hostname_nix.go#L21

pdecat commented 4 years ago

When the GCE metadata API call times out, the short (non-FQDN) hostname comes from the kubernetes apiserver hostname provider:

https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L173 https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname/kube_apiserver.go#L18 https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/kubernetes/apiserver/hostname.go#L33

pdecat commented 4 years ago

To get stable hostnames, maybe the agent could trim the hostname retrieved from GCE metadata if hostname_fqdn is false.

pdecat commented 4 years ago

A work-around is to put the node name into the DD_HOSTNAME environment variable and not rely on the agent's hostname auto-detection:

YAML:

        env:
          - name: DD_HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

Terraform:

          env {
            name = "DD_HOSTNAME"

            value_from {
              field_ref {
                field_path = "spec.nodeName"
              }
            }
          }

Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.

pdecat commented 4 years ago

This short timeout also causes issues when collecting instance attributes with GKE Metadata Server (needed for GKE Worload Identity):

root@dd-agent-6rh7z:/# agent check kubernetes_apiserver -l trace
[...]
2020-01-30 17:50:29 UTC | CORE | DEBUG | (pkg/metadata/host/host_tags.go:82 in getHostTags) | No GCE host tags Get http://169.254.169.254/computeMetadata/v1/?recursive=true: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[...]
=========
Collector
=========
  Running Checks
  ==============
    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 601ms
Check has run only once, if some metrics are missing you can try again with --check-rate to see any other metric if available.

Invoking that with curl takes more than 300ms from time to time:

root@dd-agent-6rh7z:/# time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 200 OK
Content-Type: application/json
Metadata-Flavor: Google
Server: GKE Metadata Server
Date: Thu, 30 Jan 2020 18:17:45 GMT
Content-Length: 744

{"instance":{"attributes":{"clusterLocation":"europe-west1","clusterName":"myproject-preprod-europe-west1-gke1","clusterUid":"123456789"},"hostname":"gke-myproject-preprod-eur-gke1-pool-b-e9fd4057-lbwc.c.myproject-preprod.internal","id":123456789,"serviceAccounts":{"default":{"aliases":["default"],"email":"myproject-preprod.svc.id.goog","scopes":"https://www.googleapis.com/auth/cloud-platform"},"myprojectPreprod.svc.id.goog":{"aliases":["default"],"email":"myproject-preprod.svc.id.goog","scopes":"https://www.googleapis.com/auth/cloud-platform"}},"zone":"projects/123456789/zones/europe-west1-d"},"project":{"numericProjectId":123456789,"projectId":"myproject-preprod"}}
real    0m0.345s
user    0m0.005s
sys     0m0.001s

Also a second error is shown right above revealing another incompatibility with GKE Metadata Server:

2020-01-30 17:50:28 UTC | CORE | DEBUG | (pkg/metadata/host/host.go:104 in getHostAliases) | no GCE Host Alias: unable to retrieve instance name from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/name

root@dd-agent-6rh7z:/# curl -i -s 'http://169.254.169.254/computeMetadata/v1/instance/name' -H 'Metadata-Flavor: Google'
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 30 Jan 2020 17:51:42 GMT
Content-Length: 52
GKE Metadata Server encountered an error: Not Found

pdecat commented 4 years ago

Running the first curl command directly from the GCE host returns cluster-name instead of clusterName:

patrick@gke-myproject-preprod-eur-gke1-pool-b-9f645bf0-gvc8 ~ $ time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 200 OK
Metadata-Flavor: Google
Content-Type: application/json
ETag: 123456789
Date: Thu, 30 Jan 2020 18:17:28 GMT
Server: Metadata Server for VM
Content-Length: 42245
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN

{"instance":{"attributes":{"cluster-location":"europe-west1","cluster-name":"myproject-preprod-europe-west1-gke1","cluster-uid":"123456789","configure-sh":
[...]

pdecat commented 4 years ago

Finally when GKE Metadata Concealment is enabled, the recursive call is simply forbidden:

root@dd-agent-2scln:/# time curl -i -s 'http://169.254.169.254/computeMetadata/v1/?recursive=true' -H 'Metadata-Flavor: Google'
HTTP/1.1 403 Forbidden
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 30 Jan 2020 18:27:21 GMT
Content-Length: 58

This metadata endpoint is concealed for ?recursive calls.

real    0m0.045s
user    0m0.004s
sys     0m0.005s

byronmccollum commented 4 years ago

Having same issue with hostnames changing between the short and long variants.

Also have the issue with GKE Metadata Server returning renamed metadata keys when a recursive query is used (containerName vs container-name). I have a bug in with Google regarding this.

pdecat commented 4 years ago

Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.

:warning: having short, non-FQDN, hostnames for agents seems to affect Datadog billing as these differ from the ones reported by the GCE integration.

byronmccollum commented 4 years ago

Having same issue with hostnames changing between the short and long variants.

Also have the issue with GKE Metadata Server returning renamed metadata keys when a recursive query is used (containerName vs container-name). I have a bug in with Google regarding this.

The bug should be fixed in an upcoming GKE release.

byronmccollum commented 4 years ago

Note: on GKE, this produces short, non-FQDN, hostnames for all agents, so no-data alerts are triggered, but this should be the last occurrence as they now are stable.

⚠️ having short, non-FQDN, hostnames for agents seems to affect Datadog billing as these differ from the ones reported by the GCE integration.

I have a ticket open with Datadog about this.

pdecat commented 4 years ago

Opened one on Datadog side too today.

~What's the expected resolution on the GKE side? Lower response times?~

Edit: re-read your message, nevermind.

pdecat commented 4 years ago

The anticipated work-around with hostname_fqdn: true does not work as this relies on using hostname -f which in kubernetes containers returns the pod name which is even less stable, e.g.:
root@dd-agent-ggsrq:/opt/datadog-agent# /bin/hostname -f
dd-agent-ggsrq
https://github.com/DataDog/datadog-agent/blame/6.15.0/pkg/util/hostname.go#L159-L161 https://github.com/DataDog/datadog-agent/blob/6.15.0/pkg/util/hostname_nix.go#L21

FTR, there was a recent change to actually prevent using hostname_fqdn: true in containers when they are not using the host network: https://github.com/DataDog/datadog-agent/pull/4503

Containers using the host network do actually share the host name:

# kubectl --context gke_myproject_europe-west1_myproject-europe-west1-mycluster exec -ti -n test  testpod-hostnetwork-bqn79 -- hostname -f
gke-myproject-europe-mycluster-pool-b-f358f407-9l2w.c.myproject.internal

pdecat commented 4 years ago

FWIW, SignalFx just made their previously hard-coded 1s metadata HTTP request timeout into a configurable setting, defaulting to 2s for AWS, Azure and GCP: https://github.com/signalfx/signalfx-agent/pull/1296

byronmccollum commented 4 years ago

https://github.com/DataDog/datadog-agent/pull/5419/

mhamrah commented 3 years ago

I'm curious how others are handling there GKE node names with the datadog agent. Are people using the downward API to set DD_HOSTNAME from nodeName? I find that by default the node name (shortname of the host) is not an alias by default, as DD concatenates the DD_ENV to the nodename. This makes associating metrics/logs/traces that emit a node name difficult. I've configured the agent to use the downward api, which required a fork of the helm chart. Doing this though loses the metadata call to get the FQDN. Seems janky. Are we losing anything by not having the FQDN sent to DD for GKE instances?

DataDog / datadog-agent

GCE/GKE: agent hostname sometimes changes when instance metadata retrieval times out #4429