DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
302 stars 104 forks source link

DatadogAgent service missing selector #912

Closed nicgrayson closed 1 year ago

nicgrayson commented 1 year ago

Describe what happened: I upgraded from 0.8.x to 1.1.0 and APM stopped working in my cluster.

Describe what you expected: I expected to be able to send apm traffic to http://datadog-agent-agent.datadog.svc.cluster.local:8126 but even a curl doesn't work.

Steps to reproduce the issue: Create datadog agent with the following config:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: datadog
spec:
  features:
    logCollection:
      enabled: true
      containerCollectAll: true
    apm:
      enabled: true
    kubeStateMetricsCore:
      enabled: true
    admissionController:
      enabled: true
    externalMetricsServer:
      enabled: true
      useDatadogMetrics: true
      endpoint:
        credentials:
          apiSecret:
            secretName: datadog-secret
            keyName: api-key
          appSecret:
            secretName: datadog-secret
            keyName: app-key
    clusterChecks:
      enabled: true
      useClusterChecksRunners: true
  global:
    clusterAgentTokenSecret:
      keyName: api-key
      secretName: datadog-secret
    credentials:
      apiSecret:
        secretName: datadog-secret
        keyName: api-key
      appSecret:
        secretName: datadog-secret
        keyName: app-key
  override:
    clusterAgent:
      replicas: 2

Here is my k get svc datadog-agent-agent -o yaml

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-09-06T16:57:23Z"
  labels:
    app.kubernetes.io/instance: datadog-agent
    app.kubernetes.io/managed-by: datadog-operator
    app.kubernetes.io/name: datadog-agent-deployment
    app.kubernetes.io/part-of: datadog-datadog--agent
    app.kubernetes.io/version: ""
    operator.datadoghq.com/managed-by-store: "true"
  name: datadog-agent-agent
  namespace: datadog
  ownerReferences:
  - apiVersion: datadoghq.com/v2alpha1
    blockOwnerDeletion: true
    controller: true
    kind: DatadogAgent
    name: datadog-agent
    uid: a5be1e58-d852-4ce4-a698-10f0e5c56e2e
  resourceVersion: "217564360"
  uid: 35edb98b-ac17-44d7-8e12-aa11f4877835
spec:
  clusterIP: 172.20.128.77
  clusterIPs:
  - 172.20.128.77
  internalTrafficPolicy: Local
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: traceport
    port: 8126
    protocol: TCP
    targetPort: 8126
  - name: dogstatsdport
    port: 8125
    protocol: UDP
    targetPort: 8125
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Additional environment details (Operating System, Cloud provider, etc): EKS, Deployed with ArgoCD

kaitlavs commented 1 year ago

Hi there! Thanks for reaching out to Datadog. Could you open a support ticket here so we can best troubleshoot this? Within the support ticket, it would help if you could pass along a flare from the cluster agent, a flare from the node agent, and also a pod describe of your application pod you are trying to collect traces from. Thank you!

nicgrayson commented 1 year ago

Opened ticket 1332489. When one is found, I'll post a solution here for anyone googling this error.

gunzy83 commented 1 year ago

This regression appeared for us with an upgrade of the operator from 1.0.3 to 1.1.0.

Prior to the upgrade the selector is set correctly like this:

spec:
  selector:
    agent.datadoghq.com/component: agent
    agent.datadoghq.com/name: datadog

and it is missing in 1.1.0. Rolling back to 1.0.3 resolved the issue for us.

nicgrayson commented 1 year ago

@gunzy83 Thanks for this. It's working on 1.0.3 for me.

aarashy commented 1 year ago

thanks for raising the issue, it drove me nuts