GoogleCloudPlatform / opentelemetry-operations-go

Apache License 2.0
136 stars 104 forks source link

onGKE returns true when a kops-deployed k8s cluster is deployed on GCE. #906

Closed atoulme closed 4 weeks ago

atoulme commented 1 month ago

I have created a cluster using kops on GCE.

kops create cluster data-collection.k8s.local --zones us-central1-a --state ${KOPS_STATE_STORE}/ --project=${PROJECT}

I then ran the OpenTelemetry Collector with the resourcedetectionprocessor with the gcp detector on the cluster. I saw the following error with the latest contrib 0.112.0 release:

2024-10-30T00:38:20.518Z        warn    internal/resourcedetection.go:130       failed to detect resource       {"kind": "processor", "name": "resourcedetection", "pipeline": "traces", "error": "metadata: GCE metadata \"instance/attributes/cluster-location\" not defined"}

This traces back to the resourcedetectionprocessor code that performs this call when the resourcedetection processor detector returns true when calling onGKE.

onGKE has changed recently, it performs this logic:

    // Check if we are on k8s first
    _, found := d.os.LookupEnv(k8sServiceHostEnv)
    if !found {
        return false
    }
    // If we are on k8s, make sure that we are actually on GKE, and not a
    // different managed k8s platform.
    _, err := d.GKEClusterName()
    return err == nil

d.GKEClusterName() is the equivalent of calling http://169.254.169.254/computeMetadata/v1/instance/attributes/cluster-name with the header Metadata-Flavor":"Google". From inside the container, performing such a request returns data-collection.k8s.local.

However, calling http://169.254.169.254/computeMetadata/v1/instance/attributes/cluster-location returns:

'<!DOCTYPE html>\n<html lang=en>\n  <meta charset=utf-8>\n  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n  <title>Error 404 (Not Found)!!1</title>\n  <style>\n    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}\n  </style>\n  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>\n  <p><b>404.</b> <ins>That’s an error.</ins>\n  <p>The requested URL <code>/computeMetadata/v1/instance/attributes/cluster-location</code> was not found on this server.  <ins>That’s all we know.</ins>\n'

My proposal is to change the way onGKE is run to check on cluster location instead of cluster name. It is unclear to me why this kops-registered cluster cluster name is provisioned in the compute metadata service.

dashpole commented 1 month ago

Looks like kops adds that to instances it creates: https://github.com/kubernetes/kops/blob/d3554048b827db8a085cc55cde8fcf221a16ae03/pkg/model/gcemodel/autoscalinggroup.go#L108

It doesn't currently add cluster-location, so that should be safe. But it also means that if cluster-location is added in the future, this will break again.