elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
113 stars 126 forks source link

Elastic Agent is waiting for initial configuration forever when all providers are disabled #4648

Open eyalkraft opened 2 months ago

eyalkraft commented 2 months ago

In case of no enabled providers, the elastic agent stalls forever. It seems to be a bug here: https://github.com/elastic/elastic-agent/blob/0c7212f2d92021a9e008de4abe362d0c77f78638/internal/pkg/composable/controller.go#L188-L203 where there is no way to break to DEBOUNCE if no provider is updating.

A possible solution could be based on the providers config length or the waiting group which should return immediately in this case since if there are no providers it is equal to 0. https://github.com/elastic/elastic-agent/blob/0c7212f2d92021a9e008de4abe362d0c77f78638/internal/pkg/composable/controller.go#L128

The bug was discovered as part of the work on agentless controller. Currently we use a workaround to solve this issue.

Bug details:

Setup (agent-bug is the directory name which shows up before every command)

➜  test cd agent-bug
➜  agent-bug docker pull docker.elastic.co/beats/elastic-agent:8.13.2
8.13.2: Pulling from beats/elastic-agent
c93e5d1261d3: Pull complete
9204e5b4f4d9: Pull complete
77db57972e9d: Pull complete
5a09faecb150: Pull complete
ef1d475c705b: Pull complete
9be0bc4d4489: Pull complete
97bac83776bc: Pull complete
594c586edc1b: Pull complete
b45e1922fc73: Pull complete
a9c1a4bc09dd: Pull complete
83a50bca82ec: Pull complete
4ca545ee6d5d: Pull complete
Digest: sha256:1b1346f6228c4cfcc8bd6b05e0eb24f15bcfd616935d0f2fffe0754d7d3fe31b
Status: Downloaded newer image for docker.elastic.co/beats/elastic-agent:8.13.2
docker.elastic.co/beats/elastic-agent:8.13.2

Get the original config file

➜  agent-bug docker run --rm -d --name elastic-agent docker.elastic.co/beats/elastic-agent:8.13.2 container
76af237adeb219eb7591c6b7647c3bfe523e4d73121bf1677388720b79e29d85
➜  agent-bug docker cp elastic-agent:/usr/share/elastic-agent/elastic-agent.yml ./
➜  agent-bug docker stop elastic-agent

Modify it to disable all providers

➜  agent-bug cat <<EOF >> elastic-agent.yml
providers:
  agent:
    enabled: false
  docker:
    enabled: false
  env:
    enabled: false
  host:
    enabled: false
  kubernetes:
    enabled: false
  kubernetes_leaderelection:
    enabled: false
  kubernetes_secrets:
    enabled: false
  local:
    enabled: false
  local_dynamic:
    enabled: false
  path:
    enabled: false
EOF

Start the agent with the modified config

➜  agent-bug docker run --rm -d --name elastic-agent -v ./elastic-agent.yml:/usr/share/elastic-agent/elastic-agent.yml docker.elastic.co/beats/elastic-agent:8.13.2 container
decb5e6e6d443e11d228a9ff32e8a9ad3ed78f466b7c830e85aec0a3818b9aa4

Enter the container

➜  agent-bug docker exec -it elastic-agent /bin/bash

Agent stuck on waiting for initial configuration

elastic-agent@decb5e6e6d44:~$ elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   └─ status: (STARTING) Waiting for initial configuration and composable variables

... waiting ...

elastic-agent@decb5e6e6d44:~$ elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   └─ status: (STARTING) Waiting for initial configuration and composable variables

Agent configuration (for some reason inspect stalls so I have to kill it)

elastic-agent@decb5e6e6d44:~$ elastic-agent inspect
agent:
  logging:
    to_stderr: true
inputs:
- data_stream.namespace: default
  id: unique-system-metrics-input
  streams:
  - data_stream.dataset: system.cpu
    metricsets:
    - cpu
  - data_stream.dataset: system.memory
    metricsets:
    - memory
  - data_stream.dataset: system.network
    metricsets:
    - network
  - data_stream.dataset: system.filesystem
    metricsets:
    - filesystem
  type: system/metrics
  use_output: default
outputs:
  default:
    hosts: http://elasticsearch:9200
    password: changeme
    preset: balanced
    type: elasticsearch
    username: elastic
providers:
  agent:
    enabled: false
  docker:
    enabled: false
  env:
    enabled: false
  host:
    enabled: false
  kubernetes:
    enabled: false
  kubernetes_leaderelection:
    enabled: false
  kubernetes_secrets:
    enabled: false
  local:
    enabled: false
  local_dynamic:
    enabled: false
  path:
    enabled: false

^CError: could not load agent info: could not get agent info from store: failed to load from ioStore: failed to ensure key during encrypted disk store Load: could not get agent key: failed to acquire exclusive lock: /usr/share/elastic-agent/state/vault/.lock, err: context canceled
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
elastic-agent@decb5e6e6d44:~$ exit

Cleanup

➜  agent-bug docker stop elastic-agent
elastic-agent
elasticmachine commented 2 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz commented 2 months ago

I don't think we can just exit the controller, I think we have to send an empty variables update.

The coordinator is watching for variable updates and that triggers a component model update:

https://github.com/elastic/elastic-agent/blob/5fe8debfade598f0d3bd358bb35f7f69ba5b3664/internal/pkg/agent/application/coordinator/coordinator.go#L405-L408

https://github.com/elastic/elastic-agent/blob/5fe8debfade598f0d3bd358bb35f7f69ba5b3664/internal/pkg/agent/application/coordinator/coordinator.go#L1034-L1037

https://github.com/elastic/elastic-agent/blob/5fe8debfade598f0d3bd358bb35f7f69ba5b3664/internal/pkg/agent/application/coordinator/coordinator.go#L1149-L1158

It seems like we ignore configuration changes until we get at least one variables update which seems to be what is causing this:

https://github.com/elastic/elastic-agent/blob/5fe8debfade598f0d3bd358bb35f7f69ba5b3664/internal/pkg/agent/application/coordinator/coordinator.go#L1076-L1080

eyalkraft commented 2 months ago

@cmacknz Yes this sounds reasonable 👍

I didn't mean we should exit the controller with

A possible solution could be based on the providers config length or the waiting group which should return immediately in this case since if there are no providers it is equal to 0.

What I meant was that wg.Wait() wouldn't block but instead return immediately. Unfortunately you can't

        select { 
        case <- wg.Wait(): 

But along the lines of what you suggest we could do something like

    if len(c.contextProviders) + len(c.dynamicProviders) == 0 {
        // no providers, fake a state change to trigger the initial update
        stateChangedChan <- true
    }

before the debounce logic.

cmacknz commented 2 months ago

👍 We have this in our queue to fix sometime in the next month since it seems like you aren't urgently blocked on this.

If that doesn't work, or you or your team want to try fixing this yourselves, let us know.