Closed naemono closed 4 months ago
Since the default stack version was updated to 8.12.0
(#7483), the e2e tests TestFleetKubernetesIntegrationRecipe
and TestFleetAPMIntegrationRecipe
consistently fail.
What is the ECK configuration that reproduces this so that we can try doing it locally?
What is the ECK configuration that reproduces this so that we can try doing it locally?
I reproduce by applying config/recipes/elastic-agent/fleet-kubernetes-integration.yaml with the ECK operator installed with the default settings.
Steps to reproduce:
# https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-installing-eck.html
helm repo add elastic https://helm.elastic.co
helm repo update
helm install elastic-operator elastic/eck-operator -n elastic-system --create-namespace
# deploy es+kb+fleet+agent
kubectl apply -f https://raw.githubusercontent.com/elastic/cloud-on-k8s/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml
# check the existence of the data streams
wget https://gist.githubusercontent.com/thbkrkr/0da65863799bbe415576afb78b74fde3/raw/a4f5ac0099a9bc61df81467a72cb4b08b9e7b782/eckurl && chmod +x ./eckurl
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.metricbeat-default
{"data_streams":[{"name":"metrics-elastic_agent.metricbeat-default","timestamp_field":{"name":"@timestamp"},"indices":[{"index_name":".ds-metrics-elastic_agent.metricbeat-default-2024.01.23-000001","index_uuid":"fvSO8ppTTAmHWpZ4k-5p6A","prefer_ilm":true,"ilm_policy":"metrics","managed_by":"Index Lifecycle Management"}],"generation":1,"_meta":{"package":{"name":"elastic_agent"},"managed_by":"fleet","managed":true},"status":"GREEN","template":"metrics-elastic_agent.metricbeat","ilm_policy":"metrics","next_generation_managed_by":"Index Lifecycle Management","prefer_ilm":true,"hidden":false,"system":false,"allow_custom_routing":false,"replicated":false,"time_series":{"temporal_ranges":[{"start":"2024-01-23T07:20:56.000Z","end":"2024-01-23T11:33:51.000Z"}]}}]}
🟢
./eckurl elasticsearch /_data_stream/metrics-elastic_agent.filebeat-default
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"}],"type":"index_not_found_exception","reason":"no such index [metrics-elastic_agent.filebeat-default]","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"metrics-elastic_agent.filebeat-default","index":"metrics-elastic_agent.filebeat-default"},"status":404}
🔴
Thanks, I can reproduce the problem but I still can't tell why its happening.
If I manually post to elasticsearch the index gets created. This tells me the agent must not be writing to this datastream. I did confirm that I can get these metrics from Filebeat manually by interacting with its unix socket. So that is a bit odd.
I used the following to do this:
Edit: nevermind was looking at the wrong output.
This format difference doesn't appear for an 8.12.0 agent on my Mac:
sudo curl --unix-socket /Library/Elastic/Agent/data/tmp/PGwsYWcynGUYZEjD872Gs-npqbv-30jS.sock http:/localhost/state | jq
{
"beat": {
"name": "Craigs-Macbook-Pro.local"
},
"host": {
"architecture": "arm64",
"hostname": "Craigs-Macbook-Pro.local",
"id": "48DA13D6-B83B-5C71-A4F3-494E674F9F37",
"os": {
"build": "23D56",
"family": "darwin",
"kernel": "23.3.0",
"name": "macOS",
"platform": "darwin",
"version": "14.3"
}
},
"input": {
"count": 2,
"names": [
"log"
]
},
"management": {
"enabled": true
},
"module": {
"count": 0,
"names": []
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": "JdqUYKmBQQq4IZpBjOaKNQ"
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "0d2d05af-b3af-41ef-8d8e-a604f8666a6c",
"name": "filebeat",
"version": "8.12.0"
}
}
Looking again, I now see that the format for Filebeat is correct but it is missing the Elasticsearch cluster UUID which is a condition for metrics to be collected.
Filebeat:
{
"beat": {
"name": "kind-worker"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"input": {
"count": 2,
"names": [
"log"
]
},
"management": {
"enabled": true
},
"module": {
"count": 0,
"names": []
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": ""
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "8463dd7f-4023-49d4-bf69-663da75f6f67",
"name": "filebeat",
"version": "8.12.0"
}
}
Metricbeat:
{
"beat": {
"name": "kind-worker"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"management": {
"enabled": true
},
"module": {
"count": 11,
"names": [
"system"
]
},
"output": {
"batch_size": 1600,
"clients": 1,
"name": "elasticsearch"
},
"outputs": {
"elasticsearch": {
"cluster_uuid": "hdBpRNasQTGstWKazFHmSw"
}
},
"queue": {
"name": "mem"
},
"service": {
"id": "0f919bbd-8e3d-40dd-b970-98bbfa827012",
"name": "metricbeat",
"version": "8.12.0"
}
}
The most likely reason for this is that Filebeat hasn't sent any data to Elasticsearch. In this case this makes sense because I took the Kubernetes integration out and there are not system logs on the pod to collect.
I can confirm this by creating /var/log/syslog
and putting a line in it:
touch /var/log/syslog
echo "hello!" > /var/log/syslog
Then Filebeat gets a cluster UUID and I see the metrics-elastic_agent.filebeat
index created.
So the root cause is likely that Filebeat has no data to send. In this case I removed the Kubernetes integration to simplify debugging, but that integration should be sending container logs by default so I need to take a look at why this might be the case.
Putting the k8s integration back in, I can confirm that the filestream input used for container logs is also missing the elasticsearch cluster UUID in its /state
output, because it hasn't sent and data to ES yet.
I can see that we are discovering the containers whose logs we should read, for example in the diagnostics in the computed-config.yaml
after the expansion of /var/log/containers/*${kubernetes.container.id}.log
I see:
streams:
- data_stream:
dataset: kubernetes.container_logs
id: kubernetes-container-logs-elastic-agent-agent-q99vh-99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06
parsers:
- container:
format: auto
stream: all
paths:
- /var/log/containers/*99ea050f475a9328eaf3a644f1f787168ffcb950864de9d7f2f6520ce99bab06.log
If I exec into the agent pod I and look to see if /var/log/containers
has been mounted into the agent pod I don't see it:
root@kind-worker:/usr/share/elastic-agent# ls /var/log
alternatives.log apt bootstrap.log btmp dpkg.log faillog lastlog wtmp
There should be a host path volume mount for /var/log
into the agent pod so it can read the container logs on the node:
- name: varlog
hostPath:
path: /var/log
I'm not sure where that would have been dropped if this was working before.
The only mounts I see in the agent pod are:
kubectl describe pod elastic-agent-agent-q99vh
...
Mounts:
/etc/agent.yml from config (ro,path="agent.yml")
/mnt/elastic-internal/elasticsearch-association/default/elasticsearch/certs from elasticsearch-certs (ro)
/mnt/elastic-internal/fleetserver-association/default/fleet-server/certs from fleetserver-certs-1 (ro)
/usr/share/elastic-agent/state from agent-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c82zg (ro)
For the record there are a couple of other permissions errors that would affect the k8s integration in the logs. The agent can't list a few resources and kube-state-metrics is missing.
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"W0125 16:44:04.552798 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.553Z","message":"E0125 16:44:04.553010 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: daemonsets.apps is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"daemonsets\" in API group \"apps\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.647Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes.state_namespace","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_namespace-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-01-25T16:44:04.751Z","message":"SSL/TLS verifications disabled.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"tls","log.origin":{"file.line":107,"file.name":"tlscommon/tls_config.go","function":"github.com/elastic/elastic-agent-libs/transport/tlscommon.(*TLSConfig).ToConfig"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.846Z","message":"W0125 16:44:04.846961 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"E0125 16:44:04.847157 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"storageclasses\" in API group \"storage.k8s.io\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.847Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_node","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_node-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"W0125 16:44:04.852377 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:04.852Z","message":"E0125 16:44:04.852415 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"persistentvolumes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.050Z","message":"error making http request: Get \"http://kube-state-metrics:8080/metrics\": lookup kube-state-metrics on 10.96.0.10:53: no such host","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes.state_replicaset","log.origin":{"file.line":115,"file.name":"kubernetes/state_metricset.go","function":"github.com/elastic/beats/v7/metricbeat/helper/kubernetes.(*MetricSet).Fetch"},"service.name":"metricbeat","id":"kubernetes/metrics-kubernetes.state_replicaset-a3b718bd-efec-54f4-b513-7711c744a8ec","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"W0125 16:44:05.146766 1054 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-25T16:44:05.146Z","message":"E0125 16:44:05.146931 1054 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1.CronJob: failed to list *v1.CronJob: cronjobs.batch is forbidden: User \"system:serviceaccount:default:elastic-agent\" cannot list resource \"cronjobs\" in API group \"batch\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0"}
So metrics-elastic_agent.filebeat
is a second order symptom of the agent not collecting any logs, neither system logs or container logs.
I'm not sure what would have changed this, it could be that we got filebeat metrics because we were collecting container logs and that has now stopped, or we used to collect system logs and that stopped.
Either way the root cause is something in the k8s configuration of the agent or possibly something like a container base image change that removed a system log source we used to be able to read (the elastic agent base image didn't change).
/var/log
was never configured in the fleet-kubernetes-integration.yaml recipe used in the test. Using the same manifest just changing the version to 8.11.0
with the same version of ECK and the metrics-elastic_agent.filebeat-default
data stream is created.
The .ds-logs-elastic_agent.filebeat-default-2024.01.25-000001
index has documents.
Doesn't that mean that filebeat is collecting logs?
> GET /_cat/indices | grep filebeat
green open .ds-metrics-elastic_agent.filebeat_input-default-2024.01.25-000001 tkTKKY4VTGiZ98R6QeX-qw 1 1 4085 0 3.3mb 1.6mb 1.6mb
green open .ds-logs-elastic_agent.filebeat-default-2024.01.25-000001 HlIksAZiQieVaTj3sWAuGg 1 1 1462 0 3.3mb 1.3mb 1.3mb
> GET /.ds-logs-elastic_agent.filebeat-default-2024.01.25-000001/_search \
| jq '.hits.hits[]._source.log.file.path' -r
/usr/share/elastic-agent/state/data/logs/elastic-agent-20240125-1.ndjson
...
Also the filebeat filestream input seems started reading the last log lines from an elastic-agent pod:
> k logs elastic-agent-agent-kg8bt | grep filebeat | jq .message -r
Input 'filestream' starting
registering
Connecting to backoff(elasticsearch(https://elasticsearch-es-http.default.svc:9200))
Attempting to connect to Elasticsearch version 8.12.0 (default)
Connection to backoff(elasticsearch(https://elasticsearch-es-http.default.svc:9200)) established
Non-zero metrics in the last 30s
Non-zero metrics in the last 30s
Non-zero metrics in the last 30s
logs-elastic_agent.filebeat-default
This is written by the monitoring Filebeat, which is and has always been excluded from our internal metrics collection.
I'll take a look at what happens with 8.11.0 to see if I can spot the difference.
Looking a the contents of the metrics-elastic_agent.filebeat-default index with this setup in 8.11.0 it contains documents that only have metadata and no actual fields from the Beat:
Example:
{
"_index": ".ds-metrics-elastic_agent.filebeat-default-2024.01.25-000001",
"_id": "i0-SB2e42r7pkaXFAAABjUM82Ng",
"_score": 1,
"_source": {
"@timestamp": "2024-01-26T00:47:57.656Z",
"agent": {
"ephemeral_id": "75d58779-cf51-454a-9536-23ac30f05e9f",
"id": "00c20c83-0824-44fc-9267-871eb17c5093",
"name": "kind-worker",
"type": "metricbeat",
"version": "8.11.0"
},
"data_stream": {
"dataset": "elastic_agent.filebeat",
"namespace": "default",
"type": "metrics"
},
"ecs": {
"version": "8.0.0"
},
"elastic_agent": {
"id": "00c20c83-0824-44fc-9267-871eb17c5093",
"process": "filebeat",
"snapshot": false,
"version": "8.11.0"
},
"event": {
"agent_id_status": "verified",
"ingested": "2024-01-26T00:47:58Z"
},
"host": {
"architecture": "aarch64",
"containerized": false,
"hostname": "kind-worker",
"ip": [
"10.244.1.1",
"172.18.0.2",
"fc00:f853:ccd:e793::2",
"fe80::42:acff:fe12:2"
],
"mac": [
"02-42-AC-12-00-02",
"26-B9-06-C4-E9-C9",
"7A-B7-3C-13-E3-57",
"96-1E-B7-3F-9E-03",
"9E-09-96-29-B3-8D",
"CA-76-44-33-D5-25",
"EE-5E-F1-53-37-15"
],
"name": "kind-worker",
"os": {
"codename": "focal",
"family": "debian",
"kernel": "5.10.76-linuxkit",
"name": "Ubuntu",
"platform": "ubuntu",
"version": "20.04.6 LTS (Focal Fossa)"
}
},
"metricset": {
"name": "state"
}
}
}
The cluster UUID of each of the two filebeats is empty in 8.11.0 as well, so they aren't actually collecting data either.
I think the culprit is likely https://github.com/elastic/beats/pull/37419/files#diff-028b7407303d60646b0315786de2d77607b4d4f18d93a782d3b7558daf18344e
Specifically this change:
diff --git a/metricbeat/module/beat/state/data.go b/metricbeat/module/beat/state/data.go
index b555c84bd4..5d82310460 100644
--- a/metricbeat/module/beat/state/data.go
+++ b/metricbeat/module/beat/state/data.go
@@ -77,22 +77,19 @@ func eventMapping(r mb.ReporterV2, info beat.Info, content []byte, isXpack bool)
return fmt.Errorf("failure parsing Beat's State API response: %w", err)
}
- event.MetricSetFields, _ = schema.Apply(data)
-
clusterUUID := getMonitoringClusterUUID(data)
if clusterUUID == "" {
if isOutputES(data) {
clusterUUID = getClusterUUID(data)
- if clusterUUID != "" {
- event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)
- if event.MetricSetFields != nil {
- event.MetricSetFields.Put("cluster.uuid", clusterUUID)
- }
+ if clusterUUID == "" {
+ return nil
}
}
}
+ event.ModuleFields.Put("elasticsearch.cluster.id", clusterUUID)
+
event.MetricSetFields, _ = schema.Apply(data)
if event.MetricSetFields != nil {
What that does is it doesn't send an event if the cluster UUID is empty, which is exactly what we observe happening.
The summary here is:
state
metricset to begin with, which we should stop doing. https://github.com/elastic/elastic-agent/issues/4153What I suggest you do here is re-enable this test but remove the check for the existence of metrics-elastic_agent.filebeat-default
. Otherwise you won't be testing Fleet managed agents at all while you wait for https://github.com/elastic/elastic-agent/issues/4154.
Alternatively I think if you mount /var/log/containers
into the agent pods they'll start collecting pod logs and the metrics index will also get created. Then you can keep the test the way it is.
Great investigation, thank you very much.
We no longer publish Beat metrics if the Beat isn't actually ingesting data. I don't think this is strictly a bug in that the metrics aren't interesting if we aren't publishing events, but it is not intuitive.
Makes sense, I agree, it doesn't look like a real bug.
What I suggest you do here is re-enable this test but remove the check for the existence of metrics-elastic_agent.filebeat-default.
Already done in #7497 :)
See #6494 and #6331
TestFleet* seems to be flaky again. Seeing this again:
The errors from Fleet agent that seem relevant:
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/396#018c79bc-99cf-48d8-af3e-17fb84b4176c