elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

[Fleet]: On enrolling RPM and Deb agents, `Restarting agent failed` error is displayed in CLI. #4084

Open harshitgupta-qasource opened 8 months ago

harshitgupta-qasource commented 8 months ago

Kibana Build details:

VERSION: 8.12.0 BC6
BUILD: 70088
COMMIT: e9092c0a17923f4ed984456b8a5db619b0a794b3
Artifact Link: https://staging.elastic.co/8.12.0-3eba7f46/summary-8.12.0.html#elastic-agent

Host OS and Browser version: All, All

Preconditions:

  1. 8.12.0 Kibana Cloud environment should be available.
  2. Policy should be created.
  3. Deb/RPM agent should be extracted.

Steps to reproduce:

  1. Run agent enroll command for RPM/DEB
  2. Observe that on enrolling RPM and Deb agents, Restarting agent failed error is displayed in CLI.

What's working fine:

Expected: On enrolling RPM and Deb agents restarting agent error should not display in CLI.

Screenshot: image (1)

elasticmachine commented 8 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

amolnater-qasource commented 8 months ago

Secondary Review for this ticket is Done.

cmacknz commented 8 months ago

I can reproduce this, suspect another unintended consequence of https://github.com/elastic/elastic-agent/pull/3815 where we now always consider a failure to restart with the control socket a fatal error.

The agent service isn't automatically started after running dpkg -i so the enroll commands attempt to restart it cannot succeed. Likely the fix will be similar to https://github.com/elastic/elastic-agent/pull/4042, we need to skip the attempt to restart the agent in this case because it is supposed to be manual.

ubuntu@valuable-gudgeon:~$ sudo dpkg -i ./elastic-agent-8.12.0-arm64.deb
Selecting previously unselected package elastic-agent.
(Reading database ... 66270 files and directories currently installed.)
Preparing to unpack .../elastic-agent-8.12.0-arm64.deb ...
Unpacking elastic-agent (8.12.0) ...
Setting up elastic-agent (8.12.0) ...
found symlink /usr/share/elastic-agent/bin/elastic-agent, unlink
create symlink /usr/share/elastic-agent/bin/elastic-agent to /var/lib/elastic-agent/data/elastic-agent-5cbf2e/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo systemctl status elastic-agent
○ elastic-agent.service - Agent manages other beats based on configuration provided.
     Loaded: loaded (/lib/systemd/system/elastic-agent.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
       Docs: https://www.elastic.co/beats/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo elastic-agent enroll --url=https://2d8b862d544f4fbca4ff375dfae3b19f.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=Qmtvei1vd0JvRFNMYWwxdC04bTU6R3lldEtHc01SYW1iQy1pYU9qOFRsZw==
This will replace your current settings. Do you want to continue? [Y/n]:y
{"log.level":"info","@timestamp":"2024-01-15T11:35:48.449-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":496},"message":"Starting enrollment to URL: https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-15T11:35:49.770-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":461},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-15T11:35:49.771-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Restart attempt 0 failed: 'rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/lib/elastic-agent/data/tmp/elastic-agent-control.sock: connect: no such file or directory\"'. Waiting for 2s","ecs.version":"1.6.0"}

The instructions for enrolling a DEB in Fleet already include manually starting the service already for this reason:

curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.0-arm64.deb
sudo dpkg -i elastic-agent-8.12.0-arm64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent
cmacknz commented 8 months ago

I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.

sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent
cmacknz commented 8 months ago

We should just need to pass the --skip-daemon-reload flag to the enroll command run by the DEB and RPM packages:

https://github.com/elastic/elastic-agent/blob/c35c3692b1642381beadbc4e5d9b45532abe681e/internal/pkg/agent/cmd/enroll.go#L78

cmacknz commented 7 months ago

You can also avoid the error by starting the agent service before enrolling.

sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent
cmacknz commented 7 months ago

An alternative to fixing this in the agent is to change the instructions in Fleet to start the service before enrolling:

This is what we have today:

curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.2-amd64.deb
sudo dpkg -i elastic-agent-8.12.2-amd64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent 
sudo systemctl start elastic-agent

We are also investigating automatically starting the service as part of the deb/rpm installer.

leandrojmp commented 4 months ago

Hello @cmacknz

I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.

While this is true, this has some impact when using automation tools.

For example, when using ansible it relies on the exit code of the previous command to know if it can continue to the next task on the playbook or exit with an error, currently the enroll command as described in the Fleet UI instructions will always fail, returning an exit code of 1 which will then halt the ansible playbook.

I was helping one of the infra teams in my company write an ansible playbook to deploy the agents and spent a couple of time troubleshooting why it was not working and always failing in the enrollment step.

I was only able to fix the playbook because I found this issue and the undocumented flag --skip-daemon-reload, I think this should be present in the documentation page.

After that, I tested on another server and using --delay-enroll also works.

Since the next steps consists in enable the systemd service and start it, we choose to use --delay-enroll as this is a little more faster in the ansible playbook.