elastic / fleet-server

The Fleet server allows managing a fleet of Elastic Agents.
Other
78 stars 79 forks source link

[Self-Managed]: Metricbeat gets failed once on initial installation of Fleet Server. #1519

Closed amolnater-qasource closed 4 months ago

amolnater-qasource commented 2 years ago

Kibana version: 8.3 BC1 self-managed environment

Host OS and Browser version: Windows, All

Build details:

VERSION: 8.3 BC-1 self-managed environment
BUILD: 53216
COMMIT: ad5323315a6d826fa469248b8055a8bdd1f3bb51
Artifact Link: https://staging.elastic.co/8.3.0-96e38f70/summary-8.3.0.html

Preconditions:

  1. 8.3 BC1 self-managed environment should be available.
  2. A fleet server should be installed using Fleet server policy having System and Fleet Server integration.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Navigate to Fleet server logs and select log level error from dropdown filter.
  3. Observe in logs metricbeat fails on fleet server installation and then gets restarted.
    01:10:14.403
    elastic_agent
    [elastic_agent][error] fleet-server stderr: "{\"level\":\"info\",\"time\":\"2022-06-02T01:10:14-04:00\",\"message\":\"No applicable limit for 0 agents, using default.\"}\n"
    01:11:48.299
    elastic_agent
    [elastic_agent][error] Elastic Agent status changed to: 'error'
    01:11:48.299
    elastic_agent
    [elastic_agent][error] failed to stop after 30s: application stopping timed out
    01:11:48.353
    elastic_agent
    [elastic_agent][error] 2022-06-02T01:11:48-04:00 - message: Application: metricbeat--8.3.0--36643631373035623733363936343635[8213ca76-f96c-497b-bb53-ce400f627008]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'

    Logs: elastic-agent-diagnostics-2022-06-02T05-14-58Z-00.zip

Expected Result: Metricbeat should not get failed on initial installation of Fleet Server and no error logs should be available.

Impact:

Screenshot:

1

amolnater-qasource commented 2 years ago

@manishgupta-qasource Please review.

manishgupta-qasource commented 2 years ago

Secondary review for this ticket is Done

jlind23 commented 2 years ago

@pierrehilbert @ph This is not a blocker but could you please assign someone to investigate this behaviour?

narph commented 2 years ago

@amolnater-qasource, from what I see, the issue is that Metricbeat took very long time to stop and exceeded the timeout. I tried to reproduce the scenario on Windows 11 and metricbeat managed to stop in a timely manner. I am not seeing anything in metricbeat logs that would cause this, were you able to reproduce this each time with a fresh installation?

Where there any specific steps you took when setting up the stack and Fleet?

amolnater-qasource commented 2 years ago

Hi @narph We have revalidated this on latest 8.3 Snapshot self-managed environment and found it reproducible there too.

OS: Windows 10

Steps followed:

  1. Install fleet-server and navigate to Fleet>Agents tab.
  2. Navigate to Fleet server logs and select log level error from dropdown filter.
  3. Observe in logs metricbeat fails on fleet server installation and then gets restarted.

Logs: elastic-agent-diagnostics-2022-06-13T09-57-43Z-00.zip

Screenshot: 9

Please let us know if anything else is required from our end. Thanks

gideonw commented 2 years ago

I am experiencing this error as well, except in my case I am running in a resource starved Raspberry Pi cluster and it is causing the fleet-server to crashloop.

package_policies:
- name: fleet_server-1
  id: fleet_server-1
  package:
    name: fleet_server
{
  "log.level": "error",
  "@timestamp": "2022-06-30T08:23:02.911Z",
  "log.origin": { "file.name": "process/stdlogger.go", "file.line": 54 },
  "message": "fleet-server stderr: \"{\\\"level\\\":\\\"info\\\",\\\"time\\\":\\\"2022-06-30T08:23:02Z\\\",\\\"message\\\":\\\"No applicable limit for 0 agents, using default.\\\"}\\n{\\\"level\\\":\\\"info\\\",\\\"time\\\":\\\"2022-06-30T08:23:02Z\\\",\\\"message\\\":\\\"No applicable limit for 0 agents, using default.\\\"}\\n\"",
  "agent.console.name": "fleet-server",
  "agent.console.type": "stderr",
  "ecs.version": "1.6.0"
}

I have been unsuccessful in resolving this issue, this same configuration has work on this cluster when there were more resources available. Having the option to increase the timeout might be nice if that is the eventual issue.

amolnater-qasource commented 1 year ago

Bug Conversion

Thanks!

viszsec commented 1 year ago

I got the same error as well.

14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: ")\n\t/go/src/github.com/elastic/beats/metricbeat/beater/metricbeat.go:276 +" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "0x28\ngithub.com/elastic/beats/v7/libbeat/cmd/instance.(Beat).launch.func5(" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: ")\n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go:461 +" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "0x68\nsync.(Once).doSlow" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "(0xc0005cf070, " 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "0xc000a30ce0)\n\t/usr/local/go/src/sync/once.go" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: ":68 +" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "0x178\nsync.(*Once).Do(" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "0xc0005cf070, 0xc000a30ce0" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: ")\n\t/usr/local/go/src/sync/once.go:59" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: " +0x45\ngithub.com/elastic/elastic-agent-libs/service.HandleSignals.func1(" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: ")\n\t/go/pkg/mod/github.com/elastic/elastic-agent-libs@v0.2.9/service/service.go:" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "60 +0x20c\n" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "created by github.com/elastic/elastic-agent-libs/service.HandleSignals\n\t" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: "/go/pkg/mod/github.com/elastic/elastic-agent-libs@v0.2.9/service/service.go:49" 14:12:26.018 elastic_agent [elastic_agent][error] metricbeat stderr: " +0x268\n" 14:12:26.514 elastic_agent [elastic_agent][error] failed to stop fleet-server: os: process already finished 14:12:31.921 elastic_agent [elastic_agent][error] fleet-server stderr: "{\"level\":\"info\",\"time\":\"2022-09-05T14:12:31+08:00\",\"message\":\"No applicable limit for 0 agents, using default.\"}\n{\"level\":\"info\",\"time\":\"2022-09-05T14:12:31+08:00\",\"message\":\"No applicable limit for 0 agents, using default.\"}\n"

jlind23 commented 4 months ago

[Clean up] Closing this as outdated, @amolnater-qasource feel free to reopen it if needed.