Consider collecting (by-default) all underlying host's processes in K8s

ChrsMark commented 4 months ago

In Agent standlone on K8s the process datastream is enabled by default: https://github.com/elastic/elastic-agent/blob/6aa581cbec8e6f8063571048e52a3b9f0b352c80/deploy/kubernetes/elastic-agent-standalone/elastic-agent-standalone-daemonset-configmap.yaml#L492

However it does not collect the underlying host's processes.

Would that make sense to collect the underlying system's processes (and possibly metrics) instead of those of the Agent container's scope?

I tried the following:

- id: system/metrics-system.process-52c2cd5b-0cff-4060-b0ad-a2f533124165
  data_stream:
    type: metrics
    dataset: system.process
  metricsets:
    - process
  period: 10s
  hostfs: "/hostfs"
  process.include_top_n.by_cpu: 5
  process.include_top_n.by_memory: 5
  process.cmdline.cache.enabled: true
  process.cgroups.enabled: false
  process.include_cpu_ticks: false
  processes:
    - .*

(note the hostfs: "/hostfs") part. To get the desired result:

k8ssystemprocesses

After the addition of the hostfs: "/hostfs" setting I could see the processes of the underlying host, like kubelet etc. We can consider if this should be the default or at least make the switch easier for the users with and/or commented out sections.

/cc @flash1293 @gizas

ref: https://www.elastic.co/guide/en/beats/metricbeat/current/running-on-docker.html#monitoring-host

flash1293 commented 4 months ago

Thanks for checking, @ChrsMark - this looks helpful. Another bit to consider here - this won't allow you to somehow tie the process back to Kubernetes concepts, right? E.g. telling which container the process is about or something like this.

christos68k commented 4 months ago

this won't allow you to somehow tie the process back to Kubernetes concepts, right? E.g. telling which container the process is about or something like this.

Maybe it's useful to note that we have the plumbing in place for this in ebpf-k8s-agent except we're not collecting/enriching every process on the target host, but only those associated with network flows.

ChrsMark commented 4 months ago

@flash1293 I don't think this is supported today by Metricbeat. But I would potentially see it handled by https://www.elastic.co/guide/en/beats/metricbeat/current/add-kubernetes-metadata.html. But this would require some research though to check if it's doable. The idea here is that we want the process related metrics to be associated to containers+Pods. Maybe that's possible by leveraging the cgroup's information, but I'm only hard-guessing here :).

If what @christos68k suggests (or something similar) can cover the case, then that would be also great.

flash1293 commented 4 months ago

Thanks @christos68k and @ChrsMark - seems like a somewhat high-hanging fruit for now. I think without this capability we can't produce good suggestions, as we can't tell the user which containers to annotate and also (probably even more important) won't be able to tell whether they have been instrumented already.

cmacknz commented 3 months ago

+1 to this being the default, simply seeing the processes running inside the Metricbeat or Elastic Agent container is not useful at all. Almost everyone turning this metricset on will want to see the set of processes on the node.

Additionally the processes should be correlated to their relevant Kubernetes resource types. There is some additional context on the state of this in an internal issue from our cloud SRE team. That issue shows that this correlation does not work when the cluster uses the containerd runtime, which is increasingly the default. It might work when the runtime is Docker.

flash1293 commented 3 months ago

Thanks for this link @cmacknz - am I understanding right that there's two parts missing here to enable this:

Finalizing https://github.com/elastic/elastic-agent/issues/4670 (seems like it's harder than it looked like original judging from the discussion in the issue)
Changing the default template to collect all processes and also enrich them with k8s metadata (using a script processor to also make it work on containerd)

If this is the case, I think we should go for it, as it will be a very nice feature in general and also help the auto-detection part of onboarding a lot as processes are very good signals to tell what kind of workload is running.

FYI @thomheymann @akhileshpok

flash1293 commented 3 months ago

This is also important for otel collector, we should do it for both.

gizas commented 3 months ago

Hello, summarising the issue:

We need an update with what @ChrsMark suggested in https://github.com/elastic/elastic-agent/blob/main/deploy/kubernetes/elastic-agent-standalone/elastic-agent-standalone-daemonset-configmap.yaml#L491-L504 to have it by default in the standalone agent manifests. We can make sure that kustomise templates also will inlcude it
We need an update to managed agent and include the hostfs somewhere here ?
Do we have an issue that tracks any work for the comment here?

cc @thomheymann

flash1293 commented 3 months ago

Do we have an issue that tracks any work for the comment https://github.com/elastic/elastic-agent/issues/5256#issuecomment-2270370733?

@gizas I don't think so, could you create that one?

gizas commented 3 months ago

@flash1293 https://github.com/elastic/beats/issues/40495 the issue for the processor enhancement. As already said the https://github.com/elastic/elastic-agent/issues/4670 is a prerequisite.

gizas commented 3 months ago

@flash1293 elastic/beats#40495 the issue for the processor enhancement. As already said the #4670 is a prerequisite.

@ChrsMark the above issue will track the work on agent side for the integrations.

For otel now we will need to track the same effort and analysis with host receiver and enrichment there (with k8s attributes ). Do we have something relevant with otel elastic agents? I think we need a new issue in opentelemtry-dev

gizas commented 3 months ago

@graphaelli FYI we have added this story in the backlog.

The the https://github.com/elastic/elastic-agent/issues/4670 is a prerequisite for the story to happen. That is why we have not prioritised it in this iteration

Mainly we will need a) to collect the host processes and b) to enhance them with k8s metadata.

So for the a) collection side, we will need on standalone agent templates to include the fixes (we have this story and https://github.com/elastic/elastic-agent/issues/5289 to track and not miss it) and on managed agent side the system integration will need to be updated (see comment) For the b) metadata enhancement, https://github.com/elastic/beats/issues/40495 is the issue to track the work

elastic / elastic-agent

Consider collecting (by-default) all underlying host's processes in K8s #5256