elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
21 stars 144 forks source link

Observability Kubernetes Onboarding doesn't ship data #5613

Open flash1293 opened 1 month ago

flash1293 commented 1 month ago

Following the Kubernetes onboarding flow on serverless (Add data > Monitor Infrastructure > Kubernetes) doesn't ship data. This can be reproduced on a serverless observability project and was tested with minikube running on Mac.

The logs show lots of errors like this:

{"log.level":"error","@timestamp":"2024-09-25T08:15:35.683Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":665},"message":"Unit state changed kubernetes/metrics-default-kubernetes-node-metrics-kubernetes-bcdc9c26-d274-4db3-95e0-0bb396fdd402 (STARTING->FAILED): Failed: pid '295' exited with code '-1'","log":{"source":"elastic-agent"},"component":{"id":"kubernetes/metrics-default","state":"FAILED"},"unit":{"id":"kubernetes/metrics-default-kubernetes-node-metrics-kubernetes-bcdc9c26-d274-4db3-95e0-0bb396fdd402","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}

It's possible this is a problem on the Kibana side in the flow as well, starting here for troubleshooting and we can move the issue in case it's unrelated.

A suspicion is that this is related to resourcing and the agent now needs more memory, but this needs to be confirmed.

elasticmachine commented 1 month ago

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

MichaelKatsoulis commented 1 month ago

This log error does not say anything about the reason it crashed. We would need to reproduce the environment and check the diagnostics and the agent pod consumption

flash1293 commented 1 month ago

@MichaelKatsoulis

I started a local minikube cluster, then followed the onboarding flow from a fresh Observability serverless project on prod.

MichaelKatsoulis commented 1 month ago

I replicated the scenario:

  1. Kind cluster with 38 pods and 3 nodes
  2. Fresh serverless project

I followed the instruction of monitoring Kubernetes as if I was a first time user.

I noticed the following:

  1. The kustomize command attempts to override the Elasticsearch host by setting

    -e "s/%ES_HOST%/https:\/\/katsoulis-serverless-f68892.es.us-east-1.aws.elastic.cloud/g"

    Elastic-Agent in the absence of a port, appends the port in the end which by default is 9200. So the ES_HOST ends up https://katsoulis-serverless-f68892.es.us-east-1.aws.elastic.cloud:9200 This leads to connection refused. In order to overcome this, we need to modify the ES_HOST to

    -e "s/%ES_HOST%/https:\/\/katsoulis-serverless-f68892.es.us-east-1.aws.elastic.cloud:443/g"
  2. Elastic-Agent starts successfully and data are flowing Image

  3. The first thing a user sees is a link to a dashboard which does not exist! Image

  4. In discovery we can see metrics and logs Image

  5. After some minutes we see the first restart of one of the agent's pods. Image

  6. Reason is OOM killed Image

Conclusion:

Restarts: As per my analysis and tests in https://github.com/elastic/elastic-agent/issues/4729#issuecomment-2355352224 in version 8.15.1 elastic-agent with Kubernetes and system integration needs more than 700Mb of memory.
So the limit is set low causing restarts.

Dashboard Should also Kubernetes Integration be installed under the hood which contains the assets?

ES_HOST We should always set the port of Elasticsearch because if not set, agent appends 9200.

flash1293 commented 1 month ago

Thanks for the investigation @MichaelKatsoulis !

Restarts: As per my analysis and tests in https://github.com/elastic/elastic-agent/issues/4729#issuecomment-2355352224 in version 8.15.1 elastic-agent with Kubernetes and system integration needs more than 700Mb of memory. So the limit is set low causing restarts.

I guess this is something that needs to be changed on the elastic-agent side, right?

Should also Kubernetes Integration be installed under the hood which contains the assets?

Good catch, seems like the id of the dashboard changed in this PR: https://github.com/elastic/integrations/pull/10593 We should fix it short-term, but we need to think how we can make this whole process more stable.

We should always set the port of Elasticsearch because if not set, agent appends 9200.

I see, I think in a previous version it would append it, but the config value we pull this from changed. We can fix this on the Kibana side as well.