TestFleet* seems to be flaky (again)

naemono commented 6 months ago

See #6494 and #6331

TestFleet* seems to be flaky again. Seeing this again:

elasticsearch client failed for 
https://test-agent-fleet-hh26-es-internal-http.e2e-lkeiy-mercury.svc:9200/_data_stream/metrics-elastic_agent.filebeat-default: 
404 Not Found: {Status:404 Error:{CausedBy:{Reason: Type:} 
Reason:no such index [metrics-elastic_agent.filebeat-default] 
Type:index_not_found_exception StackTrace: 
RootCause:[{Reason:no such index [metrics-elastic_agent.filebeat-default] Type:index_not_found_exception}]}}

The errors from Fleet agent that seem relevant:

{
  "log.level": "warn",
  "@timestamp": "2023-12-17T23:31:24.333Z",
  "message": "Filebeat is unable to load the ingest pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the ingest pipelines or are using Logstash pipelines, you can ignore this warning.",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "log.origin": {
    "file.line": 331,
    "file.name": "beater/filebeat.go",
    "function": "github.com/elastic/beats/v7/filebeat/beater.(*Filebeat).Run"
  },
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/396#018c79bc-99cf-48d8-af3e-17fb84b4176c

thbkrkr commented 6 months ago

The elastic-agent team has been notified of this failure (->).

thbkrkr commented 5 months ago

Since the default stack version was updated to 8.12.0 (#7483), the e2e tests TestFleetKubernetesIntegrationRecipe and TestFleetAPMIntegrationRecipe consistently fail.

cmacknz commented 5 months ago

What is the ECK configuration that reproduces this so that we can try doing it locally?

thbkrkr commented 5 months ago

What is the ECK configuration that reproduces this so that we can try doing it locally?

I reproduce by applying config/recipes/elastic-agent/fleet-kubernetes-integration.yaml with the ECK operator installed with the default settings.

Steps to reproduce:

# https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-installing-eck.html
helm repo add elastic https://helm.elastic.co
helm repo update
helm install elastic-operator elastic/eck-operator -n elastic-system --create-namespace

# deploy es+kb+fleet+agent
kubectl apply -f https://raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml

# check the existence of the data streams
wget https://gist.githubusercontent.com/thbkrkr/0da65863799bbe415576afb78b74fde3/raw/a4f5ac0099a9bc61df81467a72cb4b08b9e7b782/eckurl && chmod +x ./eckurl
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.metricbeat-default 
{"data_streams":[{"name":"metrics-elastic_agent.metricbeat-default","timestamp_field":{"name":"@timestamp"},"indices":[{"index_name":".ds-metrics-elastic_agent.metricbeat-default-2024.01.23-000001","index_uuid":"fvSO8ppTTAmHWpZ4k-5p6A","prefer_ilm":true,"ilm_policy":"metrics","managed_by":"Index Lifecycle Management"}],"generation":1,"_meta":{"package":{"name":"elastic_agent"},"managed_by":"fleet","managed":true},"status":"GREEN","template":"metrics-elastic_agent.metricbeat","ilm_policy":"metrics","next_generation_managed_by":"Index Lifecycle Management","prefer_ilm":true,"hidden":false,"system":false,"allow_custom_routing":false,"replicated":false,"time_series":{"temporal_ranges":[{"start":"2024-01-23T07:20:56.000Z","end":"2024-01-23T11:33:51.000Z"}]}}]}                          
🟢
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.filebeat-default
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"}],"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"},"status":404} 
🔴

cmacknz commented 5 months ago

Thanks, I can reproduce the problem but I still can't tell why its happening.

cmacknz commented 5 months ago

If I manually post to elasticsearch the index gets created. This tells me the agent must not be writing to this datastream. I did confirm that I can get these metrics from Filebeat manually by interacting with its unix socket. So that is a bit odd.

I used the following to do this:

``` curl -k -u elastic:$(get-password) -H 'Content-Type: application/json' -X POST "https://localhost:9200$uri" -d' { "@timestamp": "2024-01-25T14:08:17.631Z", "agent": { "ephemeral_id": "1b2dabb0-102d-45ad-8f0d-ad105fbf8b06", "id": "e0aeb13e-e3c6-4ecc-b3b7-75bee5bd2109", "name": "Craigs-Macbook-Pro.local", "type": "metricbeat", "version": "8.12.0" }, "beat": { "stats": { "libbeat": { "output": { "events": { "acked": 54, "active": 0, "batches": 1, "dropped": 0, "duplicates": 0, "failed": 0, "toomany": 0, "total": 54 }, "write": { "bytes": 7821, "errors": 0 } } } }, "type": "filebeat" }, "component": { "binary": "filebeat", "id": "log-default" }, "data_stream": { "dataset": "elastic_agent.filebeat", "namespace": "default", "type": "metrics" }, "ecs": { "version": "8.0.0" }, "elastic_agent": { "id": "e0aeb13e-e3c6-4ecc-b3b7-75bee5bd2109", "process": "filebeat", "snapshot": true, "version": "8.12.1" }, "event": { "agent_id_status": "verified", "ingested": "2024-01-12T16:08:27Z" }, "host": { "architecture": "arm64", "hostname": "craigs-macbook-pro.local", "id": "48DA13D6-B83B-5C71-A4F3-494E674F9F37", "ip": [ "10.0.0.83", "2607:fea8:e360:be00::aab0", "2607:fea8:e360:be00:187b:69eb:ce28:6d75", "2607:fea8:e360:be00:8c80:cdbd:177b:7ffd", "fe80::4d:4463:3d99:2c2c", "fe80::63a:356f:88bb:26c0", "fe80::ce81:b1c:bd2c:69e", "fe80::dd44:fd36:3e91:b14f", "fe80::eadd:8546:387d:a29a", "fe80::f4d4:88ff:fe6c:656d" ], "mac": [ "00-E0-4C-11-16-2A", "32-56-75-07-5D-B3", "32-56-75-07-5D-B4", "32-56-75-07-5D-B5", "32-56-75-07-5D-D3", "32-56-75-07-5D-D4", "32-56-75-07-5D-D5", "36-98-2B-36-20-C0", "36-98-2B-36-20-C4", "36-98-2B-36-20-C8", "5E-16-A9-E9-06-3D", "F4-D4-88-6C-65-6D", "F6-D4-88-6C-65-6D" ], "name": "craigs-macbook-pro.local", "os": { "build": "23C71", "family": "darwin", "kernel": "23.2.0", "name": "macOS", "platform": "darwin", "version": "14.2.1" } }, "metricset": { "name": "stats" } }, "fields": { "beat.stats.libbeat.output.write.bytes": [ 7821 ], "beat.stats.libbeat.output.write.errors": [ 0 ], "beat.type": [ "filebeat" ], "elastic_agent.version": [ "8.12.1" ], "component.binary": [ "filebeat" ], "elastic_agent.process": [ "filebeat" ], "beat.stats.libbeat.output.events.total": [ 54 ], "host.os.name.text": [ "macOS" ], "host.hostname": [ "craigs-macbook-pro.local" ], "host.mac": [ "00-E0-4C-11-16-2A", "32-56-75-07-5D-B3", "32-56-75-07-5D-B4", "32-56-75-07-5D-B5", "32-56-75-07-5D-D3", "32-56-75-07-5D-D4", "32-56-75-07-5D-D5", "36-98-2B-36-20-C0", "36-98-2B-36-20-C4", "36-98-2B-36-20-C8", "5E-16-A9-E9-06-3D", "F4-D4-88-6C-65-6D", "F6-D4-88-6C-65-6D" ], "host.os.build": [ "23C71" ], "host.ip": [ "10.0.0.83", "2607:fea8:e360:be00::aab0", "2607:fea8:e360:be00:187b:69eb:ce28:6d75", "2607:fea8:e360:be00:8c80:cdbd:177b:7ffd", "fe80::4d:4463:3d99:2c2c", "fe80::63a:356f:88bb:26c0", "fe80::ce81:b1c:bd2c:69e", "fe80::dd44:fd36:3e91:b14f", "fe80::eadd:8546:387d:a29a", "fe80::f4d4:88ff:fe6c:656d" ], "agent.type": [ "metricbeat" ], "component.id": [ "log-default" ], "beat.stats.libbeat.output.events.acked": [ 54 ], "host.os.version": [ "14.2.1" ], "host.os.kernel": [ "23.2.0" ], "host.os.name": [ "macOS" ], "agent.name": [ "Craigs-Macbook-Pro.local" ], "elastic_agent.snapshot": [ true ], "host.name": [ "craigs-macbook-pro.local" ], "event.agent_id_status": [ "verified" ], "host.id": [ "48DA13D6-B83B-5C71-A4F3-494E674F9F37" ], "beat.stats.libbeat.output.events.duplicates": [ 0 ], "beat.stats.libbeat.output.events.toomany": [ 0 ], "beat.stats.libbeat.output.events.batches": [ 1 ], "elastic_agent.id": [ "e0aeb13e-e3c6-4ecc-b3b7-75bee5bd2109" ], "data_stream.namespace": [ "default" ], "beat.stats.libbeat.output.events.failed": [ 0 ], "beat.stats.libbeat.output.events.dropped": [ 0 ], "data_stream.type": [ "metrics" ], "host.architecture": [ "arm64" ], "metricset.name": [ "stats" ], "event.ingested": [ "2024-01-12T16:08:27.000Z" ], "@timestamp": [ "2024-01-12T16:08:17.631Z" ], "agent.id": [ "e0aeb13e-e3c6-4ecc-b3b7-75bee5bd2109" ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "darwin" ], "data_stream.dataset": [ "elastic_agent.filebeat" ], "agent.ephemeral_id": [ "1b2dabb0-102d-45ad-8f0d-ad105fbf8b06" ], "beat.stats.libbeat.output.events.active": [ 0 ], "agent.version": [ "8.12.0" ], "host.os.family": [ "darwin" ] } ' ```

cmacknz commented 5 months ago

Edit: nevermind was looking at the wrong output.

cmacknz commented 5 months ago

This format difference doesn't appear for an 8.12.0 agent on my Mac:

sudo curl --unix-socket /Library/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock http:/localhost/state | jq
{
  "beat": {
    "name": "Craigs-Macbook-Pro.local"
  },
  "host": {
    "architecture": "arm64",
    "hostname": "Craigs-Macbook-Pro.local",
    "id": "48DA13D6-B83B-5C71-A4F3-494E674F9F37",
    "os": {
      "build": "23D56",
      "family": "darwin",
      "kernel": "23.3.0",
      "name": "macOS",
      "platform": "darwin",
      "version": "14.3"
    }
  },
  "input": {
    "count": 2,
    "names": [
      "log"
    ]
  },
  "management": {
    "enabled": true
  },
  "module": {
    "count": 0,
    "names": []
  },
  "output": {
    "batch_size": 1600,
    "clients": 1,
    "name": "elasticsearch"
  },
  "outputs": {
    "elasticsearch": {
      "cluster_uuid": "JdqUYKmBQQq4IZpBjOaKNQ"
    }
  },
  "queue": {
    "name": "mem"
  },
  "service": {
    "id": "0d2d05af-b3af-41ef-8d8e-a604f8666a6c",
    "name": "filebeat",
    "version": "8.12.0"
  }
}

cmacknz commented 5 months ago

Looking again, I now see that the format for Filebeat is correct but it is missing the Elasticsearch cluster UUID which is a condition for metrics to be collected.

Filebeat:

{
  "beat": {
    "name": "kind-worker"
  },
  "host": {
    "architecture": "aarch64",
    "containerized": false,
    "hostname": "kind-worker",
    "os": {
      "codename": "focal",
      "family": "debian",
      "kernel": "5.10.76-linuxkit",
      "name": "Ubuntu",
      "platform": "ubuntu",
      "version": "20.04.6 LTS (Focal Fossa)"
    }
  },
  "input": {
    "count": 2,
    "names": [
      "log"
    ]
  },
  "management": {
    "enabled": true
  },
  "module": {
    "count": 0,
    "names": []
  },
  "output": {
    "batch_size": 1600,
    "clients": 1,
    "name": "elasticsearch"
  },
  "outputs": {
    "elasticsearch": {
      "cluster_uuid": ""
    }
  },
  "queue": {
    "name": "mem"
  },
  "service": {
    "id": "8463dd7f-4023-49d4-bf69-663da75f6f67",
    "name": "filebeat",
    "version": "8.12.0"
  }
}

Metricbeat:

{
  "beat": {
    "name": "kind-worker"
  },
  "host": {
    "architecture": "aarch64",
    "containerized": false,
    "hostname": "kind-worker",
    "os": {
      "codename": "focal",
      "family": "debian",
      "kernel": "5.10.76-linuxkit",
      "name": "Ubuntu",
      "platform": "ubuntu",
      "version": "20.04.6 LTS (Focal Fossa)"
    }
  },
  "management": {
    "enabled": true
  },
  "module": {
    "count": 11,
    "names": [
      "system"
    ]
  },
  "output": {
    "batch_size": 1600,
    "clients": 1,
    "name": "elasticsearch"
  },
  "outputs": {
    "elasticsearch": {
      "cluster_uuid": "hdBpRNasQTGstWKazFHmSw"
    }
  },
  "queue": {
    "name": "mem"
  },
  "service": {
    "id": "0f919bbd-8e3d-40dd-b970-98bbfa827012",
    "name": "metricbeat",
    "version": "8.12.0"
  }
}

The most likely reason for this is that Filebeat hasn't sent any data to Elasticsearch. In this case this makes sense because I took the Kubernetes integration out and there are not system logs on the pod to collect.

I can confirm this by creating /var/log/syslog and putting a line in it:

touch /var/log/syslog
echo "hello!" > /var/log/syslog

Then Filebeat gets a cluster UUID and I see the metrics-elastic_agent.filebeat index created.

So the root cause is likely that Filebeat has no data to send. In this case I removed the Kubernetes integration to simplify debugging, but that integration should be sending container logs by default so I need to take a look at why this might be the case.

cmacknz commented 5 months ago

Putting the k8s integration back in, I can confirm that the filestream input used for container logs is also missing the elasticsearch cluster UUID in its /state output, because it hasn't sent and data to ES yet.

I can see that we are discovering the containers whose logs we should read, for example in the diagnostics in the computed-config.yaml after the expansion of /var/log/containers/*${kubernetes.container.id}.log I see:

      streams:
        - data_stream:
            dataset: kubernetes.container_logs
          id: kubernetes-container-logs-elastic-agent-agent-q99vh-99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06
          parsers:
            - container:
                format: auto
                stream: all
          paths:
            - /var/log/containers/*99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06.log

If I exec into the agent pod I and look to see if /var/log/containers has been mounted into the agent pod I don't see it:

root@kind-worker:/usr/share/elastic-agent# ls /var/log
alternatives.log  apt  bootstrap.log  btmp  dpkg.log  faillog  lastlog  wtmp

There should be a host path volume mount for /var/log into the agent pod so it can read the container logs on the node:

https://github.com/elastic/elastic-agent/blob/ca5fdacb66acb1e42bab130fe1af4669cfa2a552/deploy/kubernetes/elastic-agent-managed/elastic-agent-managed-daemonset.yaml#L132-L134

        - name: varlog
          hostPath:
            path: /var/log

I'm not sure where that would have been dropped if this was working before.

The only mounts I see in the agent pod are:

kubectl describe pod elastic-agent-agent-q99vh
...
    Mounts:
      /etc/agent.yml from config (ro,path="agent.yml")
      /mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs from elasticsearch-certs (ro)
      /mnt/elastic-internal/fleetserver-association/default/fleet-server/certs from fleetserver-certs-1 (ro)
      /usr/share/elastic-agent/state from agent-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c82zg (ro)

For the record there are a couple of other permissions errors that would affect the k8s integration in the logs. The agent can't list a few resources and kube-state-metrics is missing.

{"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"W0125 16:44:04.552798    1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"E0125 16:44:04.553010    1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.647Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes.state_namespace","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_namespace-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-01-25T16:44:04.751Z","message":"SSL/TLS verifications disabled.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"tls","log.origin":{"file.line":107,"file.name":"tlscommon/tls_config.go","function":"github.com/elastic/elastic-agent-libs/transport/tlscommon.(*TLSConfig).ToConfig"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.846Z","message":"W0125 16:44:04.846961    1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"E0125 16:44:04.847157    1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_node","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_node-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"W0125 16:44:04.852377    1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"E0125 16:44:04.852415    1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.050Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_replicaset","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_replicaset-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"W0125 16:44:05.146766    1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"E0125 16:44:05.146931    1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}

So metrics-elastic_agent.filebeat is a second order symptom of the agent not collecting any logs, neither system logs or container logs.

cmacknz commented 5 months ago

I'm not sure what would have changed this, it could be that we got filebeat metrics because we were collecting container logs and that has now stopped, or we used to collect system logs and that stopped.

Either way the root cause is something in the k8s configuration of the agent or possibly something like a container base image change that removed a system log source we used to be able to read (the elastic agent base image didn't change).

thbkrkr commented 5 months ago

/var/log was never configured in the fleet-kubernetes-integration.yaml recipe used in the test. Using the same manifest just changing the version to 8.11.0 with the same version of ECK and the metrics-elastic_agent.filebeat-default data stream is created.

The .ds-logs-elastic_agent.filebeat-default-2024.01.25-000001 index has documents. Doesn't that mean that filebeat is collecting logs?

> GET /_cat/indices | grep filebeat
green open .ds-metrics-elastic_agent.filebeat_input-default-2024.01.25-000001 tkTKKY4VTGiZ98R6QeX-qw 1 1  4085 0   3.3mb   1.6mb   1.6mb
green open .ds-logs-elastic_agent.filebeat-default-2024.01.25-000001          HlIksAZiQieVaTj3sWAuGg 1 1  1462 0   3.3mb   1.3mb   1.3mb

> GET /.ds-logs-elastic_agent.filebeat-default-2024.01.25-000001/_search \
    | jq '.hits.hits[]._source.log.file.path' -r
/usr/share/elastic-agent/state/data/logs/elastic-agent-20240125-1.ndjson
...

Also the filebeat filestream input seems started reading the last log lines from an elastic-agent pod:

> k logs elastic-agent-agent-kg8bt | grep filebeat | jq .message -r
Input 'filestream' starting
registering
Connecting to backoff(elasticsearch(https://elasticsearch-es-http.default.svc:9200))
Attempting to connect to Elasticsearch version 8.12.0 (default)
Connection to backoff(elasticsearch(https://elasticsearch-es-http.default.svc:9200)) established
Non-zero metrics in the last 30s
Non-zero metrics in the last 30s
Non-zero metrics in the last 30s

cmacknz commented 5 months ago

logs-elastic_agent.filebeat-default

This is written by the monitoring Filebeat, which is and has always been excluded from our internal metrics collection.

I'll take a look at what happens with 8.11.0 to see if I can spot the difference.

cmacknz commented 5 months ago

Looking a the contents of the metrics-elastic_agent.filebeat-default index with this setup in 8.11.0 it contains documents that only have metadata and no actual fields from the Beat:

Example:

{
        "_index": ".ds-metrics-elastic_agent.filebeat-default-2024.01.25-000001",
        "_id": "i0-SB2e42r7pkaXFAAABjUM82Ng",
        "_score": 1,
        "_source": {
          "@timestamp": "2024-01-26T00:47:57.656Z",
          "agent": {
            "ephemeral_id": "75d58779-cf51-454a-9536-23ac30f05e9f",
            "id": "00c20c83-0824-44fc-9267-871eb17c5093",
            "name": "kind-worker",
            "type": "metricbeat",
            "version": "8.11.0"
          },
          "data_stream": {
            "dataset": "elastic_agent.filebeat",
            "namespace": "default",
            "type": "metrics"
          },
          "ecs": {
            "version": "8.0.0"
          },
          "elastic_agent": {
            "id": "00c20c83-0824-44fc-9267-871eb17c5093",
            "process": "filebeat",
            "snapshot": false,
            "version": "8.11.0"
          },
          "event": {
            "agent_id_status": "verified",
            "ingested": "2024-01-26T00:47:58Z"
          },
          "host": {
            "architecture": "aarch64",
            "containerized": false,
            "hostname": "kind-worker",
            "ip": [
              "10.244.1.1",
              "172.18.0.2",
              "fc00:f853:ccd:e793::2",
              "fe80::42:acff:fe12:2"
            ],
            "mac": [
              "02-42-AC-12-00-02",
              "26-B9-06-C4-E9-C9",
              "7A-B7-3C-13-E3-57",
              "96-1E-B7-3F-9E-03",
              "9E-09-96-29-B3-8D",
              "CA-76-44-33-D5-25",
              "EE-5E-F1-53-37-15"
            ],
            "name": "kind-worker",
            "os": {
              "codename": "focal",
              "family": "debian",
              "kernel": "5.10.76-linuxkit",
              "name": "Ubuntu",
              "platform": "ubuntu",
              "version": "20.04.6 LTS (Focal Fossa)"
            }
          },
          "metricset": {
            "name": "state"
          }
        }
      }

The cluster UUID of each of the two filebeats is empty in 8.11.0 as well, so they aren't actually collecting data either.

I think the culprit is likely https://github.com/elastic/beats/pull/37419/files#diff-028b7407303d60646b0315786de2d77607b4d4f18d93a782d3b7558daf18344e

Specifically this change:

diff --git a/metricbeat/module/beat/state/data.go b/metricbeat/module/beat/state/data.go
index b555c84bd4..5d82310460 100644
--- a/metricbeat/module/beat/state/data.go
+++ b/metricbeat/module/beat/state/data.go
@@ -77,22 +77,19 @@ func eventMapping(r mb.ReporterV2, info beat.Info, content []byte, isXpack bool)
                return fmt.Errorf("failure parsing Beat's State API response: %w", err)
        }

-       event.MetricSetFields, _ = schema.Apply(data)
-
        clusterUUID := getMonitoringClusterUUID(data)
        if clusterUUID == "" {
                if isOutputES(data) {
                        clusterUUID = getClusterUUID(data)
-                       if clusterUUID != "" {
-                               event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)

-                               if event.MetricSetFields != nil {
-                                       event.MetricSetFields.Put("cluster.uuid", clusterUUID)
-                               }
+                       if clusterUUID == "" {
+                               return nil
                        }
                }
        }

+       event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)
+
        event.MetricSetFields, _ = schema.Apply(data)

        if event.MetricSetFields != nil {

What that does is it doesn't send an event if the cluster UUID is empty, which is exactly what we observe happening.

cmacknz commented 5 months ago

The summary here is:

We no longer publish Beat metrics if the Beat isn't actually ingesting data. I don't think this is strictly a bug in that the metrics aren't interesting if we aren't publishing events, but it is not intuitive.
Looking closer at the metrics data we are pointlessly collecting the state metricset to begin with, which we should stop doing. https://github.com/elastic/elastic-agent/issues/4153
We can collect metrics in a way where they don't depend on the beat also having published at least one event. https://github.com/elastic/elastic-agent/issues/4154

What I suggest you do here is re-enable this test but remove the check for the existence of metrics-elastic_agent.filebeat-default. Otherwise you won't be testing Fleet managed agents at all while you wait for https://github.com/elastic/elastic-agent/issues/4154.

Alternatively I think if you mount /var/log/containers into the agent pods they'll start collecting pod logs and the metrics index will also get created. Then you can keep the test the way it is.

thbkrkr commented 5 months ago

Great investigation, thank you very much.

We no longer publish Beat metrics if the Beat isn't actually ingesting data. I don't think this is strictly a bug in that the metrics aren't interesting if we aren't publishing events, but it is not intuitive.

Makes sense, I agree, it doesn't look like a real bug.

What I suggest you do here is re-enable this test but remove the check for the existence of metrics-elastic_agent.filebeat-default.

Already done in #7497 :)

elastic / cloud-on-k8s

TestFleet* seems to be flaky (again) #7389