Open btrieger opened 1 month ago
so I was able to reproduce the issue of the configmap on my end as well.
application.DefaultAgentFleetConfig
thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!Thus to support this feature we should:
I am gonna follow also about the list Nodes permissions 🙂
Ah my apologies on the 0640. Does fsGroup not mount it as 0:1000 instead of 0:0 so that I would have access to it? I appear to be able to read it since it is disabling the leader election after the 1st restart.
In any regards I figure the solutions is to find a way to add configs that can be passed down to an agent from fleet or merge the configmap with what fleet provides.
Even if the above wasn't like that, the replaceWith invoked by ReplaceOnSuccessStore here is having a value of application.DefaultAgentFleetConfig thus I don't see how a custom supplied config merges with the fleet one, but maybe I am missing something?!
https://github.com/elastic/elastic-agent/pull/4166 a change was made that if the new config contains the content of the default fleet config we won't do the replacement by rotation to help. This is not really obvious at all unless you know this PR exists. The default fleet config only contains fleet.enabled: true
so having that should have been enough to get past that. Quoting the PR:
Skipping replacing the current agent configuration with default fleet configuration upon enrollment, in case the current configuration already contains the configuration from the default fleet configuration.
I added a volume mount to set the /etc/elastic-agent folder to be owned by 0:1000 so that I would be able to write to it and I can confirm the device is busy is the result. I also udpated the mount for the configmap to be 0660 and confirmed elastic-agent is the group so it has read and write.
oh I see @cmacknz it is the other way around from what I understood the diff to be! I validated that by adding fleet.enabled: true
in my static config I get no rotation and thus I get no error. However all my testing is with elastic-agent:8.16.0-SNAPSHOT which does "extra things" in the agent-state path. @btrieger I can see that you already have fleet.enabled: true
in your config so could you send me the exact error you are seeing and maybe in the parallel give 8.16.0-SNAPSHOT a go? 🙂
Yeah I can share the error. I read through the code and then update my config to be:
fleet:
enabled: true
instead of
fleet.enabled: true
and that made it skip the replace. Both are valid yamls but would cause the diff to to not work.
Here is the error on 8.15.0:
Policy selected for enrollment: 8c813cf6-a816-4722-be51-7341a192ba2e
{"log.level":"info","@timestamp":"2024-09-19T17:20:28.581Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-09-19T17:20:29.591Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.(*enrollCmd).enrollWithBackoff","file.name":"cmd/enroll_cmd.go","file.line":524},"message":"1st enrollment attempt failed, retrying enrolling to URL: https://076433a0edff4346b220c692d2e9c56a.fleet.us-central1.gcp.cloud.es.io:443/ with exponential backoff (init 1s, max 10s)","ecs.version":"1.6.0"}
Error: fail to enroll: failed to store agent config: could not save enrollment information: could not backup /etc/elastic-agent/agent.yml: rename /etc/elastic-agent/agent.yml /etc/elastic-agent/agent.yml.2024-09-19T17-20-29.5911.bak: device or resource busy
and here is the configmap:
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-node-datastreams
namespace: elastic
labels:
k8s-app: elastic-agent-k8s-test
data:
agent.yml: |-
fleet.enabled: true
providers.kubernetes_leaderelection.enabled: false
providers.kubernetes.resources.node.enabled: false
providers.kubernetes.resources.pod.enabled: false
---
When I did:
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-node-datastreams
namespace: elastic
labels:
k8s-app: elastic-agent-k8s-test
data:
agent.yml: |-
fleet:
enabled: true
providers.kubernetes_leaderelection.enabled: false
providers.kubernetes.resources.node.enabled: false
providers.kubernetes.resources.pod.enabled: false
---
It did not throw the error
I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.
hmmm yep I think that by default gopkg.in/yaml.v3 (package used to calculate the diff) does not split keys by dots when unmarshalling YAML into a map[string]interface{}. So it makes sense
I can't currently test 8.16.0 as my Elastic Cloud cluster is 8.15.1 which is latest available.
No need to since the solution to this issue is the one you already figured out 🙂
I still unfortunately haven't been able to get the node watcher turned off either. Tried a bunch of different settings to disable it.
@btrieger after looking at the code I found a really puzzling (to me) piece of validation code here which seems to cause the creation of watchers for pods and nodes always by default. Now I understand that this might be indeed the desired default behaviour but there should be a top level key enabled
in the kubernetes provider that when the user specifies as false
nothing gets enabled. @cmacknz thoughts?
Now @btrieger since you mentioned about nodes permissions only and not pod ones 😄 the following config will help get the nodes list error away, but we can't get away at the same with both that and the pods watcher
providers:
kubernetes_leaderelection:
enabled: false
kubernetes:
resources:
pod:
enabled: true
node:
enabled: false
Ah so there is a bug where I can't disable both only one? It appeared it also needed get watch and list access for namespaces.
Ah so there is a bug where I can't disable both only one? It appeared it also needed get watch and list access for namespaces.
then let's try to go even more aggressive on disabling 🙂
providers:
kubernetes_leaderelection:
enabled: false
kubernetes:
add_resource_metadata:
node:
enabled: false
namespace:
enabled: false
deployment: false
cronjob: false
resources:
pod:
enabled: true
node:
enabled: false
I will try it again when I am back at my computer. So if I want to disable pods, namespaces, and nodes? Is that doable? Or is there a bug? Essentially want to just disable the kubernetes provider.
I will try it again when I am back at my computer. So if I want to disable pods, namespaces, and nodes? Is that doable? Or is there a bug? Essentially want to just disable the kubernetes provider.
From what I am seeing currently no, it's not doable. But hey I missed the fleet.enabled: true
above so 🤞 I missed something here as well!?!
However, I am thinking that even if we manage to disable them at the agent-level, the add_kubernetes_metadata
processor is enabled by default in filebeat and metribeat when invoked by elastic-agent. And I am afraid that the same permission errors will surface from there. But I am getting ahead myself 🙂 But there is definitely something to discuss with the team here, ty for you understanding and patience
I got a chance to test it. Looks like still getting the errors but it could be the add_kubernetes_metadata processor not sure:
{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:37.123Z","message":"Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /usr/share/elastic-agent/state/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go","function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"W0920 13:12:38.011880 53 reflector.go:539] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.014Z","message":"E0920 13:12:38.011927 53 reflector.go:147] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: Failed to watch *v1.Node: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-20T13:12:38.021Z","message":"W0920 13:12:38.021120 28 reflector.go:539] k8s.io/client-go@v0.29.5/tools/cache/reflector.go:229: failed to list *v1.Node: nodes \"gk3-brieger-autopilot-nap-1s7i913f-4ca63d27-pphd\" is forbidden: User \"system:serviceaccount:elastic:elastic-agent\" cannot list resource \"nodes\" in API group \"\" at the cluster scope","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"httpjson-default","type":"httpjson"},"log":{"source":"httpjson-default"},"ecs.version":"1.6.0"}
yep I think this is coming from metricbeat now ..."component":{"binary":"metricbeat"...
Yes that is the add_kubernetes_metadata processor which currently can't be turned off. This requires https://github.com/elastic/elastic-agent/issues/4670.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
A faster way around https://github.com/elastic/elastic-agent/issues/4670 since it is complex would be to expose the parts we need to configure via an env var. Something similar was done with the add_cloud_metadata
processor.
We have proposed that the customer set automountServiceAccountToken: false
in the Kubernetes manifest. This appears to prevent the K8s metadata and providers from starting and is ideal when the customer does not want to monitor K8s with their agent pods (for example when they are running an S3/SQS workload).
See the last line in this partial snippet for the location of the addition:
---
# For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: elastic-agent
namespace: kube-system
labels:
app: elastic-agent
spec:
selector:
matchLabels:
app: elastic-agent
template:
metadata:
labels:
app: elastic-agent
spec:
# Tolerations are needed to run Elastic Agent on Kubernetes control-plane nodes.
# Agents running on control-plane nodes collect metrics from the control plane components (scheduler, controller manager) of Kubernetes
automountServiceAccountToken: false
The above proposed solution solved the customer's problem as removing the service token prevents the provider from starting in the first place
For confirmed bugs, please report:
Attempting to deploy elastic agent on kubernetes to run threat intel integration or other api integrations. I am receiving errors related to missing permissions in kubernetes. As I am not running the kubernetes integration or monitoring kubernetes itself I shouldn't need access to the kubernetes api server to watch nodes, namespaces, and pods. I also should not need to create a lease without the kubernetes integration.
I am attempting to run fleet managed agents on kubernetes. I have deployed the following yamls:
I have also tried with the configmap:
and:
The first time the pod starts it fails with the following error:
After it restarts the pod runs without lease election but repeatedly throws the below errors:
I would expect the pod to not need to throw an error and restart to disable leader election and to be able to disable these watchers.