elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
133 stars 142 forks source link

8.5.1 agents go unhealthy #1790

Closed pjbertels closed 1 year ago

pjbertels commented 1 year ago

Issues encountered during Fleet Scaling testing with drones and a subset of real VMs.

Version: 8.5.1

Operating System: Linux Ubuntu VM (e2-standard-8)

Steps to Reproduce: We used some tooling to bring up 200 VMs and 9800 Horde drones, some VM's report errors in the logs on the way up and take longer to come up. Once we begin testing some VMs go unhealthy(33/200).

pjbertels commented 1 year ago

Here are the logs from one of the VMs:

10:54:00.981
elastic_agent
[elastic_agent][info] APM instrumentation disabled
10:54:00.983
elastic_agent
[elastic_agent][info] Detecting execution mode
10:54:00.983
elastic_agent
[elastic_agent][info] Agent is managed locally
10:54:00.983
elastic_agent
[elastic_agent][info] capabilities file not found in /opt/Elastic/Agent/capabilities.yml
10:54:01.397
elastic_agent
[elastic_agent][info] Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
10:54:01.398
elastic_agent
[elastic_agent][info] Starting stats endpoint
10:54:01.398
elastic_agent
[elastic_agent][info] Agent is starting
10:54:01.399
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/elastic-agent.sock (configured: unix:///opt/Elastic/Agent/data/tmp/elastic-agent.sock)
10:54:01.399
elastic_agent
[elastic_agent][info] Agent is stopped
10:54:01.399
elastic_agent
[elastic_agent][info] Configuration changes detected
10:54:01.404
elastic_agent
[elastic_agent][info] Source URI reset from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:54:01.404
elastic_agent
[elastic_agent][info] New State ID is ppTx2cPe
10:54:01.404
elastic_agent
[elastic_agent][info] Converging state requires execution of 2 step(s)
10:54:02.095
elastic_agent
[elastic_agent][info] reexec Shutdown channel triggered
10:54:02.097
elastic_agent
[elastic_agent][info] Restarting Elastic Agent
10:54:02.097
elastic_agent
[elastic_agent][info] waiting for installer of pipeline 'default' to finish
10:54:02.104
elastic_agent
[elastic_agent][info] Signaling application to stop because of shutdown: metricbeat--8.5.1
10:54:02.380
elastic_agent
[elastic_agent][info] APM instrumentation disabled
10:54:02.382
elastic_agent
[elastic_agent][info] Detecting execution mode
10:54:02.398
elastic_agent
[elastic_agent][info] Agent is managed by Fleet
10:54:02.398
elastic_agent
[elastic_agent][info] capabilities file not found in /opt/Elastic/Agent/capabilities.yml
10:54:02.791
elastic_agent
[elastic_agent][info] Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
10:54:02.791
elastic_agent
[elastic_agent][info] Starting stats endpoint
10:54:02.792
elastic_agent
[elastic_agent][info] Agent is starting
10:54:02.792
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/elastic-agent.sock (configured: unix:///opt/Elastic/Agent/data/tmp/elastic-agent.sock)
10:54:04.774
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:54:04.775
elastic_agent
[elastic_agent][info] New State ID is Nk2lL5Z-
10:54:04.775
elastic_agent
[elastic_agent][info] Converging state requires execution of 4 step(s)
10:54:16.972
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:16Z - message: Application: endpoint-security--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:54:26.527
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:26Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:54:27.727
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:27Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:54:36.134
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:36Z - message: Application: metricbeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:54:36.377
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
10:54:36.866
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:36Z - message: Application: filebeat--8.5.1--36643631373035623733363936343635[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:54:37.178
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
10:54:37.331
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:37Z - message: Application: metricbeat--8.5.1--36643631373035623733363936343635[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:54:37.334
elastic_agent
[elastic_agent][info] Updating internal state
10:54:37.399
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:37Z - message: Application: metricbeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:54:38.081
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:38Z - message: Application: filebeat--8.5.1--36643631373035623733363936343635[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:54:38.597
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:38Z - message: Application: metricbeat--8.5.1--36643631373035623733363936343635[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:54:46.144
elastic_agent
[elastic_agent][info] 2022-11-24T15:54:46Z - message: Application: endpoint-security--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RUNNING: Protecting with policy {4d2eac66-89e2-492a-a803-220d65fc0f70} - type: 'STATE' - sub_type: 'RUNNING'
12:00:33.696
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
12:00:33.696
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
12:00:33.698
elastic_agent
[elastic_agent][info] New State ID is 1vpjk-gr
12:00:33.698
elastic_agent
[elastic_agent][info] Converging state requires execution of 4 step(s)
12:00:33.850
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for endpoint-security.8.5.1
12:00:33.850
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for endpoint-security.8.5.1
12:00:34.084
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:00:34.084
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.1
12:00:34.367
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:00:34.367
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:00:34.606
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:00:34.606
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.1
12:00:34.889
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:00:34.889
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:00:34.894
elastic_agent
[elastic_agent][info] Updating internal state
12:50:19.066
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
12:50:19.067
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
12:50:19.068
elastic_agent
[elastic_agent][info] New State ID is lRsBU2cL
12:50:19.069
elastic_agent
[elastic_agent][info] Converging state requires execution of 4 step(s)
12:50:19.219
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for endpoint-security.8.5.1
12:50:19.219
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for endpoint-security.8.5.1
12:50:19.471
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:50:19.471
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.1
12:50:19.756
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:50:19.756
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:50:19.900
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.5.1-187f2c97: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65765-2049, Finished: false, Fileinfo: &{auth.log 47665 416 {163532742 63804908720 0x560688b86220} {2049 65765 1 33184 104 4 0 0 47665 4096 104 {1669311064 260418006} {1669311920 163532742} {1669311920 163532742} [0 0 0]}}, Source: /var/log/auth.log, Offset: 48765, Timestamp: 2022-11-24 17:47:20.414186027 +0000 UTC m=+6773.844941457, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65765-2049}\n\n"
12:50:19.901
elastic_agent
[elastic_agent][error] 2022-11-24T17:50:19Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to FAILED: 1 error occurred:
    * 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65765-2049, Finished: false, Fileinfo: &{auth.log 47665 416 {163532742 63804908720 0x560688b86220} {2049 65765 1 33184 104 4 0 0 47665 4096 104 {1669311064 260418006} {1669311920 163532742} {1669311920 163532742} [0 0 0]}}, Source: /var/log/auth.log, Offset: 48765, Timestamp: 2022-11-24 17:47:20.414186027 +0000 UTC m=+6773.844941457, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65765-2049}

 - type: 'ERROR' - sub_type: 'FAILED'
12:50:19.996
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:50:19.996
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.1
12:50:20.288
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:50:20.289
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:50:20.291
elastic_agent
[elastic_agent][info] Updating internal state
12:50:20.389
elastic_agent
[elastic_agent][error] lazy acker: failed ack batch, enqueue for retry: []fleetapi.Action{(*fleetapi.ActionPolicyChange)(0xc0006be1e0)}
12:50:22.803
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
12:50:22.803
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
12:50:22.805
elastic_agent
[elastic_agent][info] New State ID is lRsBU2cL
12:50:22.805
elastic_agent
[elastic_agent][info] Converging state requires execution of 3 step(s)
12:50:23.037
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:50:23.037
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to "online": ""
12:50:23.037
elastic_agent
[elastic_agent][info] 2022-11-24T17:50:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
12:50:23.038
elastic_agent
[elastic_agent][info] 2022-11-24T17:50:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'
12:50:23.223
elastic_agent
[elastic_agent][error] filebeat stderr: "Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 4474\n"
12:50:23.228
elastic_agent
[elastic_agent][info] 2022-11-24T17:50:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RESTARTING: exited with code: 1 - type: 'STATE' - sub_type: 'STARTING'
12:50:23.228
elastic_agent
[elastic_agent][info] 2022-11-24T17:50:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
12:50:23.228
elastic_agent
[elastic_agent][info] 2022-11-24T17:50:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'
12:50:23.345
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:50:23.345
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:50:23.415
elastic_agent
[elastic_agent][error] filebeat stderr: "Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 4474\n"
12:50:23.591
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.1
12:50:23.591
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.1
12:50:23.882
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.1
12:50:23.883
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.1
12:50:23.886
elastic_agent
[elastic_agent][info] Updating internal state
12:51:23.490
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to "degraded": "app filebeat--8.5.1-187f2c97: Missed last check-in"
12:51:23.490
elastic_agent
[elastic_agent][info] 2022-11-24T17:51:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
12:52:23.497
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.5.1-187f2c97: Missed two check-ins"
12:52:23.497
elastic_agent
[elastic_agent][error] 2022-11-24T17:52:23Z - message: Application: filebeat--8.5.1[a24d0a4b-42ca-4fd2-ae6e-59383a6781dc]:
pjbertels commented 1 year ago

Here is the status:

ubuntu@ogc-e81cb331-elastic-agent-ubuntu:~$ sudo elastic-agent status
Status: FAILED
Message: app filebeat--8.5.1-187f2c97: Missed two check-ins
Applications:
  * endpoint-security      (HEALTHY)
                           Protecting with policy {4d2eac66-89e2-492a-a803-220d65fc0f70}
  * filebeat               (FAILED)
                           Missed two check-ins
  * metricbeat             (HEALTHY)
                           Running
  * filebeat_monitoring    (HEALTHY)
                           Running
  * metricbeat_monitoring  (HEALTHY)
                           Running
cmacknz commented 1 year ago
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.5.1-187f2c97: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65765-2049, Finished: false, Fileinfo: &{auth.log 47665 416 {163532742 63804908720 0x560688b86220} {2049 65765 1 33184 104 4 0 0 47665 4096 104 {1669311064 260418006} {1669311920 163532742} {1669311920 163532742} [0 0 0]}}, Source: /var/log/auth.log, Offset: 48765, Timestamp: 2022-11-24 17:47:20.414186027 +0000 UTC m=+6773.844941457, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65765-2049}\n\n"
12:50:19.901

@belimawr has a bug for this problem tracked elsewhere, I couldn't find the link looking quickly. Does Filebeat eventually recover from this? That is usually what we observe. If it doesn't we likely need to raise the severity of the bug.

cmacknz commented 1 year ago

Found the relevant bug https://github.com/elastic/beats/issues/33653

ceeeekay commented 1 year ago

Might be unrelated, so I'm just posting to add a data point, but I've been chasing the same "operation skipped" loop messages for the last day or so. The difference in my case is that I can get new agents to start up if they enrol with an Elasticsearch output, but not with a Logstash output - they will sit forever and never reach healthy.

Once they've enrolled with an Elasticsearch output I can switch back to Logstash and everything seems fine, unless I restart an agent at which point I'm back in the same unhealthy / startup loop.

If I then switch back to Elasticsearch output, the unhealthy agents all come back on line.

pjbertels commented 1 year ago

I also observed the issue with 8.5.2 agents. I left this setup running over the weekend and it did not recover, my agents are still marked by Fleet as unhealthy. The problem seems to happen when ever I make a change that results in a lot of network activity. Often it is a when moving the agents between policies.

10:10:42.102
elastic_agent
[elastic_agent][info] APM instrumentation disabled
10:10:42.104
elastic_agent
[elastic_agent][info] Detecting execution mode
10:10:42.105
elastic_agent
[elastic_agent][info] Agent is managed locally
10:10:42.105
elastic_agent
[elastic_agent][info] capabilities file not found in /opt/Elastic/Agent/capabilities.yml
10:10:42.676
elastic_agent
[elastic_agent][info] Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
10:10:42.677
elastic_agent
[elastic_agent][info] Starting stats endpoint
10:10:42.677
elastic_agent
[elastic_agent][info] Agent is starting
10:10:42.678
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/elastic-agent.sock (configured: unix:///opt/Elastic/Agent/data/tmp/elastic-agent.sock)
10:10:42.678
elastic_agent
[elastic_agent][info] Agent is stopped
10:10:42.679
elastic_agent
[elastic_agent][info] Configuration changes detected
10:10:42.685
elastic_agent
[elastic_agent][info] Source URI reset from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:10:42.685
elastic_agent
[elastic_agent][info] New State ID is NLIys1oS
10:10:42.685
elastic_agent
[elastic_agent][info] Converging state requires execution of 2 step(s)
10:10:43.556
elastic_agent
[elastic_agent][info] reexec Shutdown channel triggered
10:10:43.556
elastic_agent
[elastic_agent][info] Restarting Elastic Agent
10:10:43.556
elastic_agent
[elastic_agent][info] waiting for installer of pipeline 'default' to finish
10:10:43.573
elastic_agent
[elastic_agent][info] Signaling application to stop because of shutdown: metricbeat--8.5.2
10:10:43.575
elastic_agent
[elastic_agent][info] Source URI reset from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:10:43.862
elastic_agent
[elastic_agent][info] APM instrumentation disabled
10:10:43.864
elastic_agent
[elastic_agent][info] Detecting execution mode
10:10:43.881
elastic_agent
[elastic_agent][info] Agent is managed by Fleet
10:10:43.881
elastic_agent
[elastic_agent][info] capabilities file not found in /opt/Elastic/Agent/capabilities.yml
10:10:44.228
elastic_agent
[elastic_agent][info] Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
10:10:44.229
elastic_agent
[elastic_agent][info] Starting stats endpoint
10:10:44.229
elastic_agent
[elastic_agent][info] Agent is starting
10:10:44.229
elastic_agent
[elastic_agent][info] Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/elastic-agent.sock (configured: unix:///opt/Elastic/Agent/data/tmp/elastic-agent.sock)
10:10:46.170
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:10:46.171
elastic_agent
[elastic_agent][info] New State ID is DpwmbSTj
10:10:46.171
elastic_agent
[elastic_agent][info] Converging state requires execution of 4 step(s)
10:10:52.401
elastic_agent
[elastic_agent][info] 2022-11-25T15:10:52Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:10:53.564
elastic_agent
[elastic_agent][info] 2022-11-25T15:10:53Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:10:58.833
elastic_agent
[elastic_agent][info] 2022-11-25T15:10:58Z - message: Application: metricbeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:11:00.085
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:00Z - message: Application: metricbeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:11:03.060
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:03Z - message: Application: endpoint-security--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:11:03.637
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:11:03.741
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:03Z - message: Application: filebeat--8.5.2--36643631373035623733363936343635[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:11:03.980
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:11:04.211
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:04Z - message: Application: metricbeat--8.5.2--36643631373035623733363936343635[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:11:04.217
elastic_agent
[elastic_agent][info] Updating internal state
10:11:04.903
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:04Z - message: Application: filebeat--8.5.2--36643631373035623733363936343635[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:11:05.421
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:05Z - message: Application: metricbeat--8.5.2--36643631373035623733363936343635[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
10:11:33.409
elastic_agent
[elastic_agent][info] 2022-11-25T15:11:33Z - message: Application: endpoint-security--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Protecting with policy {4d2eac66-89e2-492a-a803-220d65fc0f70} - type: 'STATE' - sub_type: 'RUNNING'
10:12:54.456
elastic_agent
[elastic_agent][error] lazy acker: failed ack batch, enqueue for retry: []fleetapi.Action{(*fleetapi.ActionPolicyReassign)(0xc00064e040)}
10:13:07.321
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:13:07.322
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
10:13:07.324
elastic_agent
[elastic_agent][info] New State ID is blfDr6k2
10:13:07.324
elastic_agent
[elastic_agent][info] Converging state requires execution of 4 step(s)
10:13:07.516
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:07.516
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.2
10:13:07.744
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.5.2-541572af: 1 error occurred:\n\t* 2 errors: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65761-2049, Finished: false, Fileinfo: &{auth.log 28751 416 {560260638 63804985843 0x56268cb170c0} {2049 65761 1 33184 104 4 0 0 28751 4096 64 {1669388370 181926416} {1669389043 560260638} {1669389043 560260638} [0 0 0]}}, Source: /var/log/auth.log, Offset: 29084, Timestamp: 2022-11-25 15:11:14.149589201 +0000 UTC m=+21.693797461, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65761-2049}; Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65699-2049, Finished: false, Fileinfo: &{syslog 182333 416 {572261817 63804985843 0x56268cb170c0} {2049 65699 1 33184 104 4 0 0 182333 4096 360 {1669388363 813926088} {1669389043 572261817} {1669389043 572261817} [0 0 0]}}, Source: /var/log/syslog, Offset: 187220, Timestamp: 2022-11-25 15:12:50.225119703 +0000 UTC m=+117.769327963, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65699-2049}\n\n"
10:13:07.744
elastic_agent
[elastic_agent][error] 2022-11-25T15:13:07Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to FAILED: 1 error occurred:
    * 2 errors: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65761-2049, Finished: false, Fileinfo: &{auth.log 28751 416 {560260638 63804985843 0x56268cb170c0} {2049 65761 1 33184 104 4 0 0 28751 4096 64 {1669388370 181926416} {1669389043 560260638} {1669389043 560260638} [0 0 0]}}, Source: /var/log/auth.log, Offset: 29084, Timestamp: 2022-11-25 15:11:14.149589201 +0000 UTC m=+21.693797461, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65761-2049}; Error creating runner from config: Can only start an input when all related states are finished: {Id: native::65699-2049, Finished: false, Fileinfo: &{syslog 182333 416 {572261817 63804985843 0x56268cb170c0} {2049 65699 1 33184 104 4 0 0 182333 4096 360 {1669388363 813926088} {1669389043 572261817} {1669389043 572261817} [0 0 0]}}, Source: /var/log/syslog, Offset: 187220, Timestamp: 2022-11-25 15:12:50.225119703 +0000 UTC m=+117.769327963, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 65699-2049}

 - type: 'ERROR' - sub_type: 'FAILED'
10:13:07.748
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:07.749
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:07.929
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for endpoint-security.8.5.2
10:13:07.929
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for endpoint-security.8.5.2
10:13:08.004
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:08Z - message: Application: endpoint-security--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RUNNING: Protecting with policy {35482b04-d74c-4b58-a8b1-651fc051a0fe} - type: 'STATE' - sub_type: 'RUNNING'
10:13:08.126
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:08.126
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.2
10:13:08.359
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:08.359
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:08.361
elastic_agent
[elastic_agent][info] Updating internal state
10:13:08.492
elastic_agent
[elastic_agent][error] lazy acker: failed ack batch, enqueue for retry: []fleetapi.Action{(*fleetapi.ActionPolicyChange)(0xc000650330)}
10:13:12.125
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:13:12.126
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
10:13:12.128
elastic_agent
[elastic_agent][info] New State ID is blfDr6k2
10:13:12.128
elastic_agent
[elastic_agent][info] Converging state requires execution of 3 step(s)
10:13:12.321
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:12.322
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to "online": ""
10:13:12.322
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:12Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:13:12.322
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:12Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'
10:13:12.458
elastic_agent
[elastic_agent][error] filebeat stderr: "Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 3313\n"
10:13:12.461
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:12Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RESTARTING: exited with code: 1 - type: 'STATE' - sub_type: 'STARTING'
10:13:12.462
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:12Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'
10:13:12.462
elastic_agent
[elastic_agent][info] 2022-11-25T15:13:12Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
10:13:12.561
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:12.561
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:12.597
elastic_agent
[elastic_agent][error] filebeat stderr: "Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 3313\n"
10:13:12.765
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:12.765
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.2
10:13:13.002
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:13.002
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:13.005
elastic_agent
[elastic_agent][info] Updating internal state
10:13:13.052
elastic_agent
[elastic_agent][error] lazy acker: failed ack batch, enqueue for retry: []fleetapi.Action{(*fleetapi.ActionPolicyChange)(0xc0008c3320)}
10:13:18.172
elastic_agent
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
10:13:18.172
elastic_agent
[elastic_agent][info] Source URI in operator changed to "https://artifacts.elastic.co/downloads/"
10:13:18.174
elastic_agent
[elastic_agent][info] New State ID is blfDr6k2
10:13:18.174
elastic_agent
[elastic_agent][info] Converging state requires execution of 3 step(s)
10:13:18.366
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:18.366
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.2
10:13:18.600
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:18.600
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:18.799
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for filebeat.8.5.2
10:13:18.799
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for filebeat.8.5.2
10:13:19.033
elastic_agent
[elastic_agent][info] operation 'operation-install' skipped for metricbeat.8.5.2
10:13:19.033
elastic_agent
[elastic_agent][info] operation 'operation-start' skipped for metricbeat.8.5.2
10:13:19.035
elastic_agent
[elastic_agent][info] Updating internal state
10:14:09.033
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to "degraded": "app filebeat--8.5.2-541572af: Missed last check-in"
10:14:09.033
elastic_agent
[elastic_agent][info] 2022-11-25T15:14:09Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
10:15:09.042
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app filebeat--8.5.2-541572af: Missed two check-ins"
10:15:09.043
elastic_agent
[elastic_agent][error] 2022-11-25T15:15:09Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
10:15:09.043
elastic_agent
[elastic_agent][error] 2022-11-25T15:15:09Z - message: Application: filebeat--8.5.2[2a9455e0-ccf1-47d7-8701-7664c4802907]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
Showing entries until Nov 25, 10:15:09
Agent logging level

info
cmacknz commented 1 year ago

@ceeeekay There is at least one known bug where the policy revision can get out of sync between the agent and Fleet using the Logstash output: https://github.com/elastic/fleet-server/issues/2105. Possibly that is related to the problems you are observing.

cmacknz commented 1 year ago

@pjbertels looking at the error [elastic_agent][error] filebeat stderr: "Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 3313\n" 10:13:12.765

This should have been fixed, the original issue was reopened here: https://github.com/elastic/beats/issues/31670

ceeeekay commented 1 year ago

@ceeeekay There is at least one known bug where the policy revision can get out of sync between the agent and Fleet using the Logstash output: elastic/fleet-server#2105. Possibly that is related to the problems you are observing.

@cmacknz That looks like the one - thanks.

fearful-symmetry commented 1 year ago

@pjbertels what environment is fleet running in here? K8s? Docker? Native? I've recently ran into some issues with how this behaves inside certain k8s environments where the namespace's PID counter will reset but the filesystem remains the same, meaning the PID in the lockfile can be mapped to a new process.

Also, have we confirmed that the beats aren't trying to run multiple instances out of the same data directory?

pjbertels commented 1 year ago

This is an agent installed in an Ubuntu VM in GCP. The VM is created and the sw is installed in the usual way by a tool we have(OGC). Fleet is running in ESS. There are a few issues that have been identified and we will retest when we can get those fixes.

AndersonQ commented 1 year ago

The problem seems to happen when ever I make a change that results in a lot of network activity. Often it is a when moving the agents between policies.

I can only imagine that the problem is indeed a change in the policy, which makes the agent restart and for some reason it does not restart correctly

pjbertels commented 1 year ago

We think the issue is this one ... https://github.com/elastic/beats/issues/33653.

0xdeadbeer commented 1 year ago

More info from my side:

agent version: 8.4.3 integrations: system & docker metrics and logs

no matter how or in which order i enroll it, if the docker metrics and logs integration is present filebeat will fail.. (and result in the agent appearing as unhealthy)

Status: FAILED
Message: app filebeat--8.4.3-38e1f0dc: 1 error occurred:
        * 5 errors: Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields

Applications:
  * filebeat  (FAILED)
              1 error occurred:
              * 5 errors: Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields; Error creating runner from config: unexpected to option in processors.0.add_fields

  * metricbeat             (HEALTHY)
                           Running
  * filebeat_monitoring    (HEALTHY)
                           Running
  * metricbeat_monitoring  (HEALTHY)
                           Running
0xdeadbeer commented 1 year ago

After checking the logs, this is the only type of log i see that stands out from the rest..

image

cmacknz commented 1 year ago

@osamu-kj unexpected to option in processors.0.add_field is the key error in your case, likely the syntax in one of your processor definitions is wrong. If you have not added custom processors manually then this is a bug in the integration and you can open a new issue in https://github.com/elastic/integrations/.

0xdeadbeer commented 1 year ago

@cmacknz in the meantime i've upgraded elk and all of the agents to 8.5.2 and now its working fine, i'll let you know if something goes wrong on my side as well or similar.. thanks for mentioning it :)

jlind23 commented 1 year ago

@pjbertels is this still occurring or can we close that issue?

jlind23 commented 1 year ago

Not relevant anymore, closing.