elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 133 forks source link

Elastic Agent enroll fails to restart daemon on docker #3628

Open AndersonQ opened 11 months ago

AndersonQ commented 11 months ago

The Elastic Agent fails to restart its daemon during enroll when running from the docker image. Subsequent starts of the docker container succeed.

The culprit is https://github.com/elastic/elastic-agent/commit/f7e558f736d5c17b5488a66e9051df814b95c050

docker run \
  --env FLEET_ENROLL=1 \
  --env FLEET_URL=https://fleet-url:8220/ \
  --env FLEET_ENROLLMENT_TOKEN=SOME_TOKEN \
  --env FLEET_INSERUCE=true \
  docker.elastic.co/beats/elastic-agent:8.12.0-SNAPSHOT

Some of our tests are failing:

logs:

root@elastic-agent:~# docker run \
  --env FLEET_ENROLL=1 \
  --env FLEET_URL=https://some.fleet.url:port \
  --env FLEET_ENROLLMENT_TOKEN=SOME_TOKE  \
  --env FLEET_INSERUCE=true \
  docker.elastic.co/beats/elastic-agent:8.12.0-SNAPSHOT

{"log.level":"info","@timestamp":"2023-10-18T16:09:56.069Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":497},"message":"Starting enrollment to URL: https://fc2e07ab4001499380ce57a763e698fd.fleet.us-east-1.aws.staging.elastic.cloud:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:10:19.256Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:10:59.261Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:11:59.262Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:12:59.265Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:13:59.266Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":468},"message":"Retrying to restart...","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-10-18T16:13:59.266Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":280},"message":"Elastic Agent might not be running; unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /usr/share/elastic-agent/state/data/tmp/elastic-agent-control.sock: connect: no such file or directory\"","ecs.version":"1.6.0"}
Something went wrong while enrolling the Elastic Agent: could not reload agent's daemon, all retries failed. Last error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /usr/share/elastic-agent/state/data/tmp/elastic-agent-control.sock: connect: no such file or directory"
Error: could not reload agent daemon, unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /usr/share/elastic-agent/state/data/tmp/elastic-agent-control.sock: connect: no such file or directory"
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
Error: enrollment failed: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html

root@elastic-agent:~# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

root@elastic-agent:~# docker ps --all
CONTAINER ID   IMAGE                                                   COMMAND                  CREATED         STATUS                      PORTS     NAMES
e07749ec9372   docker.elastic.co/beats/elastic-agent:8.12.0-SNAPSHOT   "/usr/bin/tini -- /u…"   4 minutes ago   Exited (1) 38 seconds ago             happy_visvesvaraya

root@elastic-agent:~# docker start e07749ec9372
e07749ec9372

root@elastic-agent:~# docker logs -f e07749ec9372
{"log.level":"info","@timestamp":"2023-10-18T16:14:49.613Z","log.origin":{"file.name":"cmd/run.go","file.line":155},"message":"Elastic Agent started","log":{"source":"elastic-agent"},"process.pid":7,"agent.version":"8.12.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:14:49.823Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":113},"message":"agent is not upgradable, not starting watcher","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-18T16:14:49.823Z","log.origin":{"file.name":"cmd/run.go","file.line":242},"message":"APM instrumentation disabled","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
cmacknz commented 11 months ago

Let's revert the commit that caused this to fix the 8.11 and 8.12 branch quickly while we figure out how to fix this properly.

elasticmachine commented 11 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 11 months ago

The original commits causing this have now been reverted.

pierrehilbert commented 10 months ago

I created https://github.com/elastic/elastic-agent/issues/3732 to make it easier to test this change.

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)