Issues with automatic provisioning of scenarios with Fleet Server and Agent

jsoriano commented 1 year ago

On elastic-package stack up, we have identified a couple of issues that when combined can lead to unreliable provisioning of automated scenarios involving Fleet Server and Agent. These issues are:

Fleet Server seems to be restarted during bootstrap while it is reporting to be healthy.
Elastic Agent immediately fails if it starts and cannot reach Fleet Server.

This may not be new, but we have identified it as much more frequent in 8.6.

Where is this an issue?

Any automation that waits for Fleet Server to be healthy before starting to enroll Agents can find that first enrollments fail because Fleet Server is not available. This can be reproduced with elastic-package stack up (up to elasitc-package 0.72.0), where the following happens, orchestrated by docker-compose:

Start other stack services (package-registry, Elasticsearch, Kibana).
Wait for the healtchecks to pass.
Start Fleet Server.
Wait for the healtchecks to pass.
Start and enroll Elastic Agent.
Wait for elastic-agent status to be healthy.

The issue is that between steps 4 and 5, after Fleet Server has reported to be healthy, it goes back to an state where it cannot accept connections, so step 5 fails and the process is aborted.

This can be seen on this build for example: https://fleet-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Felastic-package/detail/PR-1118/6/pipeline

Fleet Server goes healthy, and then it seems to be restarted or reconfigured (multiple times?):

{"log.level":"info","@timestamp":"2023-01-27T17:09:40.929Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Component state changed fleet-server-default (STARTING->HEALTHY): Healthy: communicating with pid '70'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
...
{"log.level":"info","@timestamp":"2023-01-27T17:09:40.934Z","message":"starting server on configuration change","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","ecs.version":"1.6.0"}
...
{"log.level":"warn","@timestamp":"2023-01-27T17:09:41.559Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":829},"message":"Unit state changed fleet-server-default (STARTING->DEGRADED): Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-01-27T17:09:41.559Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":829},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->DEGRADED): Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process)","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"DEGRADED","old_state":"STARTING"},"ecs.version":"1.6.0"}
...
{"log.level":"info","@timestamp":"2023-01-27T17:09:45.473Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Component state changed fleet-server-default (STARTING->HEALTHY): Healthy: communicating with pid '96'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
...
{"log.level":"info","@timestamp":"2023-01-27T17:09:49.266Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":831},"message":"Unit state changed fleet-server-default (STARTING->HEALTHY): Running on default policy with Fleet Server integration","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"HEALTHY"},"unit":{"id":"fleet-server-default","type":"output","state":"HEALTHY","old_state":"STARTING"},"ecs.version":"1.6.0"}
...

Elastic Agent immediately fails, and exits:

Policy selected for enrollment:  elastic-agent-managed-ep
{"log.level":"info","@timestamp":"2023-01-27T17:09:48.714Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Starting enrollment to URL: https://fleet-server:8220/","ecs.version":"1.6.0"}
Error: fail to enroll: fail to execute request to fleet-server: dial tcp 172.18.0.6:8220: connect: connection refused
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html

Full logs here:

Proposed changes

Fleet Server shouldn't report to be healthy till it is ready to allow agent connections.
Elastic Agent should retry connections with Fleet Server, to handle Fleet Server restarts. It already does it when trying to connect to other services.

Workaround

Restart elastic-agent if it fails during enrollment. This change has been applied to elastic-package starting on 0.73.0 (https://github.com/elastic/elastic-package/pull/1118).

joeperuzzi commented 1 year ago

@jsoriano we're not sure if this is needed in the default instance, but we were having some issues with the docker setup yesterday and found that the package-registry instance needed this added to its docker config in-order to come online:

security_opt:
  - seccomp:unconfined

Otherwise we received an immediate error:

runtime/cgo: pthread_create failed: Operation not permitted

jsoriano commented 1 year ago

I actually wanted to create this issue in elastic-agent :facepalm: Moving it.

jsoriano commented 1 year ago

@jsoriano we're not sure if this is needed in the default instance, but we were having some issues with the docker setup yesterday and found that the package-registry instance needed this added to its docker config in-order to come online:
security_opt:
  - seccomp:unconfined
Otherwise we received an immediate error:

runtime/cgo: pthread_create failed: Operation not permitted

I don't think this is related to this issue. I haven't seen this kind of problems with package-registry before. @joeperuzzi could you please create an issue in https://github.com/elastic/elastic-package repository with information about your environment and the version of elastic-package that you are using?

chrispangg commented 1 year ago

Both elastic-agent and elastic-agent-complete images are still having this issue. What is the workaround for this if I am using docker-compose to provision the fleet-server (elastic-agent)?

elastic / elastic-agent