Open harshitgupta-qasource opened 8 months ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Secondary Review for this ticket is Done.
I can reproduce this, suspect another unintended consequence of https://github.com/elastic/elastic-agent/pull/3815 where we now always consider a failure to restart with the control socket a fatal error.
The agent service isn't automatically started after running dpkg -i
so the enroll
commands attempt to restart it cannot succeed. Likely the fix will be similar to https://github.com/elastic/elastic-agent/pull/4042, we need to skip the attempt to restart the agent in this case because it is supposed to be manual.
ubuntu@valuable-gudgeon:~$ sudo dpkg -i ./elastic-agent-8.12.0-arm64.deb
Selecting previously unselected package elastic-agent.
(Reading database ... 66270 files and directories currently installed.)
Preparing to unpack .../elastic-agent-8.12.0-arm64.deb ...
Unpacking elastic-agent (8.12.0) ...
Setting up elastic-agent (8.12.0) ...
found symlink /usr/share/elastic-agent/bin/elastic-agent, unlink
create symlink /usr/share/elastic-agent/bin/elastic-agent to /var/lib/elastic-agent/data/elastic-agent-5cbf2e/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo systemctl status elastic-agent
○ elastic-agent.service - Agent manages other beats based on configuration provided.
Loaded: loaded (/lib/systemd/system/elastic-agent.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: https://www.elastic.co/beats/elastic-agent
ubuntu@valuable-gudgeon:~$ sudo elastic-agent enroll --url=https://2d8b862d544f4fbca4ff375dfae3b19f.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=Qmtvei1vd0JvRFNMYWwxdC04bTU6R3lldEtHc01SYW1iQy1pYU9qOFRsZw==
This will replace your current settings. Do you want to continue? [Y/n]:y
{"log.level":"info","@timestamp":"2024-01-15T11:35:48.449-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":496},"message":"Starting enrollment to URL: https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-01-15T11:35:49.770-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":461},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-01-15T11:35:49.771-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":475},"message":"Restart attempt 0 failed: 'rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/lib/elastic-agent/data/tmp/elastic-agent-control.sock: connect: no such file or directory\"'. Waiting for 2s","ecs.version":"1.6.0"}
The instructions for enrolling a DEB in Fleet already include manually starting the service already for this reason:
curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.0-arm64.deb
sudo dpkg -i elastic-agent-8.12.0-arm64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent
sudo systemctl start elastic-agent
I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.
sudo systemctl enable elastic-agent
sudo systemctl start elastic-agent
We should just need to pass the --skip-daemon-reload
flag to the enroll command run by the DEB and RPM packages:
You can also avoid the error by starting the agent service before enrolling.
sudo systemctl enable elastic-agent
sudo systemctl start elastic-agent
An alternative to fixing this in the agent is to change the instructions in Fleet to start the service before enrolling:
This is what we have today:
curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.2-amd64.deb
sudo dpkg -i elastic-agent-8.12.2-amd64.deb
sudo elastic-agent enroll --url=https://XXXXX.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=XXXXX
sudo systemctl enable elastic-agent
sudo systemctl start elastic-agent
We are also investigating automatically starting the service as part of the deb/rpm installer.
Hello @cmacknz
I should note that the error here doesn't mean the enrollment failed, enrollment actually succeeded and if you ignore the error and continue with the following the agent successfully connects to Fleet.
While this is true, this has some impact when using automation tools.
For example, when using ansible it relies on the exit code of the previous command to know if it can continue to the next task on the playbook or exit with an error, currently the enroll
command as described in the Fleet UI instructions will always fail, returning an exit code of 1
which will then halt the ansible playbook.
I was helping one of the infra teams in my company write an ansible playbook to deploy the agents and spent a couple of time troubleshooting why it was not working and always failing in the enrollment step.
I was only able to fix the playbook because I found this issue and the undocumented flag --skip-daemon-reload
, I think this should be present in the documentation page.
After that, I tested on another server and using --delay-enroll
also works.
Since the next steps consists in enable the systemd service and start it, we choose to use --delay-enroll
as this is a little more faster in the ansible playbook.
Kibana Build details:
Host OS and Browser version: All, All
Preconditions:
Steps to reproduce:
Restarting agent failed
error is displayed in CLI.What's working fine:
Expected: On enrolling RPM and Deb agents restarting agent error should not display in CLI.
Screenshot: