elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Windows upgrade rollback can fail in the case the service is off #4443

Open blakerouse opened 6 months ago

blakerouse commented 6 months ago

In the case that upgrading to a new version of an Elastic Agent and that version fails to start properly the Windows service for the Elastic Agent will be stopped. The watcher only calls Restart and not Start when the service is stopped. This prevents Restart from working correctly. The watcher should perform a start in the case that the service is off, and a restart in the case that the service is running.

See logs for the behavior of the watcher:

{"log.level":"info","@timestamp":"2024-03-19T23:57:51.412Z","log.origin":{"file.name":"cmd/watch.go","file.line":68},"message":"Upgrade Watcher started","process.pid":740,"agent.version":"8.14.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:57:51.415Z","log.origin":{"file.name":"cmd/watch.go","file.line":80},"message":"Loaded update marker &{Version:8.14.0-SNAPSHOT Hash:1d84d2 VersionedHome:data\\elastic-agent-8.14.0-SNAPSHOT-1d84d2 UpdatedOn:2024-03-19 23:57:51.099069 +0000 UTC PrevVersion:8.14.0-SNAPSHOT PrevHash:9312f5 PrevVersionedHome:data\\elastic-agent-8.14.0-SNAPSHOT-9312f5 Acked:false Action:action_id: 82198157-8bf1-4409-9c5c-337f13b5941a, type: UPGRADE Details:0xc00039b5e0}","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:57:51.425Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":66},"message":"Agent watcher started","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:58:21.430Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T23:58:21.431Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":141},"message":"Failed connecting to running daemon: connection error: desc = \"transport: error while dialing: open \\\\\\\\.\\\\pipe\\\\elastic-agent-system: The system cannot find the file specified.\"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:58:51.443Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T23:58:51.443Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":141},"message":"Failed connecting to running daemon: connection error: desc = \"transport: error while dialing: open \\\\\\\\.\\\\pipe\\\\elastic-agent-system: The system cannot find the file specified.\"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:59:21.456Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T23:59:21.456Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":141},"message":"Failed connecting to running daemon: connection error: desc = \"transport: error while dialing: open \\\\\\\\.\\\\pipe\\\\elastic-agent-system: The system cannot find the file specified.\"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-19T23:59:51.470Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-19T23:59:51.470Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":141},"message":"Failed connecting to running daemon: connection error: desc = \"transport: error while dialing: open \\\\\\\\.\\\\pipe\\\\elastic-agent-system: The system cannot find the file specified.\"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:21.471Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":133},"message":"Trying to connect to agent","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:00:21.471Z","log.origin":{"file.name":"upgrade/watcher.go","file.line":141},"message":"Failed connecting to running daemon: connection error: desc = \"transport: error while dialing: open \\\\\\\\.\\\\pipe\\\\elastic-agent-system: The system cannot find the file specified.\"","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:00:21.471Z","log.origin":{"file.name":"cmd/watch.go","file.line":183},"message":"Agent Error detected: failed to connect to agent daemon '5' times in a row","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:00:21.471Z","log.origin":{"file.name":"cmd/watch.go","file.line":117},"message":"Error detected, proceeding to rollback: %vfailed to connect to agent daemon '5' times in a row","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:21.480Z","log.origin":{"file.name":"upgrade/step_relink.go","file.line":31},"message":"Changing symlink","symlink_path":"C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe","new_path":"C:\\Program Files\\Elastic\\Agent\\data\\elastic-agent-8.14.0-SNAPSHOT-9312f5\\elastic-agent.exe","prev_path":"C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:21.482Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":164},"message":"Updating active commit","file.path":"C:\\Program Files\\Elastic\\Agent\\.elastic-agent.active.commit","hash":"9312f5","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:21.483Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":60},"message":"Restarting the agent after rollback","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:31.497Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":202},"message":"Restarting Agent via control protocol; attempt 1 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:34.498Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":209},"message":"Failed to restart agent via control protocol: failed communicating to running daemon: context deadline exceeded","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:34.498Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":213},"message":"Restarting Agent via service; attempt 1 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:34.499Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":218},"message":"Failed to restart agent via service: failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:34.499Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":225},"message":"Failed to restart agent; will try again in 20s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:54.509Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":202},"message":"Restarting Agent via control protocol; attempt 2 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:57.510Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":209},"message":"Failed to restart agent via control protocol: failed communicating to running daemon: context deadline exceeded","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:00:57.510Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":213},"message":"Restarting Agent via service; attempt 2 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:57.510Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":218},"message":"Failed to restart agent via service: failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:00:57.510Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":225},"message":"Failed to restart agent; will try again in 40s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:01:37.516Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":202},"message":"Restarting Agent via control protocol; attempt 3 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:01:40.521Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":209},"message":"Failed to restart agent via control protocol: failed communicating to running daemon: context deadline exceeded","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:01:40.521Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":213},"message":"Restarting Agent via service; attempt 3 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:01:40.521Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":218},"message":"Failed to restart agent via service: failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:01:40.522Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":225},"message":"Failed to restart agent; will try again in 1m20s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:03:00.526Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":202},"message":"Restarting Agent via control protocol; attempt 4 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:03:03.532Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":209},"message":"Failed to restart agent via control protocol: failed communicating to running daemon: context deadline exceeded","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:03:03.532Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":213},"message":"Restarting Agent via service; attempt 4 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:03:03.532Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":218},"message":"Failed to restart agent via service: failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:03:03.532Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":225},"message":"Failed to restart agent; will try again in 1m30s","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:04:33.547Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":202},"message":"Restarting Agent via control protocol; attempt 5 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:04:36.551Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":209},"message":"Failed to restart agent via control protocol: failed communicating to running daemon: context deadline exceeded","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-20T00:04:36.551Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":213},"message":"Restarting Agent via service; attempt 5 of 5","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-03-20T00:04:36.551Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":218},"message":"Failed to restart agent via service: failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:04:36.551Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":222},"message":"Failed to restart agent after final attempt","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:04:36.551Z","log.origin":{"file.name":"cmd/watch.go","file.line":122},"message":"rollback failedfailed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started.","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-03-20T00:04:36.557Z","log.origin":{"file.name":"cmd/watch.go","file.line":57},"message":"Watch command failed","error":{"message":"failed to restart agent via service: failed to restart service (Elastic Agent): The service has not been started."},"ecs.version":"1.6.0"}

For confirmed bugs, please report:

elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse commented 6 months ago

This happened because there was a bug in my code for unprivileged mode, but I still believe it is the correct thing to do here. That is because its possible that the service gets stopped and then basically the upgrade becomes broken.

elasticmachine commented 4 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)