elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
121 stars 131 forks source link

Error uninstalling agent: [...] missing matching binary for [...] #4167

Open strawgate opened 7 months ago

strawgate commented 7 months ago

For confirmed bugs, please report:

CI is occasionally failing.

Agent installed on Windows via MSI with --delay-enroll --url=https://placeholder:443 --enrollment-token=token and then uninstalled with -f

Fails to uninstall and prints   | **Error: error uninstalling agent: error uninstalling components: failed to detect inputs and outputs: missing matching binary for C:\Program Files\Elastic\Agent\data\elastic-agent-9c40d6\components\apm-server.spec.yml**

More log context:

WARNING: Running command: c:\\Program Files\\Elastic\\Agent\\elastic-agent.exe uninstall -f
--
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Error fetching PID info for 0, skipping: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
  | error fetching status: OpenProcess failed for pid=0: The parameter is incorrect. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:173 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Error fetching PID info for 4, skipping: GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:173 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 116, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 364, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 496, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 572, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.567682 +0000 GMT m=+0.075581901 processes Non fatal error fetching PID some info for 580, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.571989 +0000 GMT m=+0.079888801 processes Non fatal error fetching PID some info for 712, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.5901418 +0000 GMT m=+0.098041301 processes Non fatal error fetching PID some info for 3136, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | {debug 2024-01-31 13:24:21.6018655 +0000 GMT m=+0.109764701 processes Non fatal error fetching PID some info for 2008, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied. github.com/elastic/elastic-agent-system-metrics@v0.9.1/metric/system/process/process.go:176 }
  | Error: error uninstalling agent: error uninstalling components: failed to detect inputs and outputs: missing matching binary for C:\Program Files\Elastic\Agent\data\elastic-agent-9c40d6\components\apm-server.spec.yml
  | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
  |  
  | Agent uninstall return code:1

Example CI Failure: https://buildkite.com/elastic/elastic-stack-installers/builds/3671#018d5f93-affb-4019-a534-b53fcf8fe1e9

It would be nice to figure out why this is failing but second best would be making this not fatally prevent uninstallation

elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pierrehilbert commented 7 months ago

Have you been able to reproduce this on Windows by using the zip installer and not the MSI one? It would help to understand where the problem is coming from

strawgate commented 7 months ago

I have not, the CI is just using the MSI installer, I have not used the Zip recently

cmacknz commented 7 months ago

We've never seen this failure in CI without the MSI.

The error is coming from the uninstall logic in https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/internal/pkg/agent/install/uninstall.go#L71-L84

Specifically it is coming from https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/internal/pkg/agent/install/uninstall.go#L198-L201

This leads us to https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/pkg/component/load.go#L109-L119

This error means that apm-server.spec.yml exists but apm-server.exe does not.

The exact same function call that returns this error is made right when the agent starts up, so we must have had apm-server.exe at some point and now we don't.

https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/internal/pkg/agent/application/application.go#L66-L70

cmacknz commented 7 months ago

The exact same function call that returns this error is made right when the agent starts up, so we must have had apm-server.exe at some point and now we don't.

Alternatively the agent was never installed or is not running correctly and uninstall is failing with the same error.

cmacknz commented 7 months ago

A little earlier in the buidlkite logs I see an msiexec error for an install attempt but I can't tell exactly what caused it (garbage enrollment token and URL?):

VERBOSE: Invoking msiexec.exe /i C:\Users\buildkite\esi\bin\out\elastic\elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.msi /qn /norestart INSTALLARGS="--delay-enroll --url=https://placeholder:443 --enrollment-token=token" /l*v C:\buildkite-agent\builds\bk-agent-prod-gcp-1706705535567506444\elastic\elastic-stack-installers\src\agent-qa\logs\20240131-012004-i.log

...
--
  | WARNING: Elastic Agent uninstall returned:
  | WARNING: Failed to parse msi log for errors
  | WARNING: Dumping full MSI Log
  | === Verbose logging started: 1/31/2024  13:20:04  Build type: SHIP UNICODE 5.00.10011.00  Calling process: C:\Windows\system32\msiexec.exe ===
  | MSI (c) (5C:5C) [13:20:04:649]: Resetting cached policy values
  | MSI (c) (5C:5C) [13:20:04:649]: Machine policy value 'Debug' is 0

Action start 13:20:05: UpgradeAction.
--
  | SFXCA: Extracting custom action to temporary directory: C:\Windows\Installer\MSI3432.tmp-\
  | SFXCA: Binding to CLR version v4.0.30319
  | Calling custom action BeatPackageCompiler!Elastic.PackageCompiler.Beats.AgentCustomAction.UpgradeAction
  | Detected an agent upgrade via MSI, which is not supported. Aborting.
  | CustomAction UpgradeAction returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
  | Action ended 13:20:05: UpgradeAction. Return value 3.
  | Action ended 13:20:05: INSTALL. Return value 3.
=== Logging stopped: 1/31/2024  13:20:05 ===
--
  | MSI (s) (38:E8) [13:20:05:260]: User policy value 'DisableRollback' is 0
  | MSI (s) (38:E8) [13:20:05:260]: Machine policy value 'DisableRollback' is 0
  | MSI (s) (38:E8) [13:20:05:260]: Incrementing counter to disable shutdown. Counter after increment: 0
  | MSI (s) (38:E8) [13:20:05:260]: Note: 1: 1402 2: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Installer\Rollback\Scripts 3: 2
  | MSI (s) (38:E8) [13:20:05:260]: Note: 1: 1402 2: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Installer\Rollback\Scripts 3: 2
  | MSI (s) (38:E8) [13:20:05:260]: Decrementing counter to disable shutdown. If counter >= 0, shutdown will be denied.  Counter after decrement: -1
  | MSI (s) (38:E8) [13:20:05:260]: Destroying RemoteAPI object.
  | MSI (s) (38:B4) [13:20:05:260]: Custom Action Manager thread ending.
  | MSI (c) (5C:5C) [13:20:05:260]: Decrementing counter to disable shutdown. If counter >= 0, shutdown will be denied.  Counter after decrement: -1
  | MSI (c) (5C:5C) [13:20:05:260]: MainEngineThread is returning 1603
  | === Verbose logging stopped: 1/31/2024  13:20:05 ===
  |  
  |  
  | VERBOSE: msiexec reports error 1603 = fatal error occurred during installation.
cmacknz commented 7 months ago

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

strawgate commented 7 months ago

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

The tests use a garbage fleet url and enrollment token and then start the agent service and verify that the agent creates a log and that fleet enrollment fails with failed to perform delayed enrollment: fail to enroll: fail to execute request to fleet-server: lookup placeholder: no such host

Does fleet enrollment happen after verifying components?

There is a bug at the moment where the name of the test does not print before the test starts (will fix this in a PR today) but there are several tests which purposefully fail the installation to verify it fails when we expect it to fail.

This has occurred once in CI, I'll see if I can reproduce outside of CI.

cmacknz commented 7 months ago

Does fleet enrollment happen after verifying components?

~Yes~ No, for both elastic-agent enroll and elastic-agent install --delay-enroll. In the delay enroll case the delayed enrollment is done quite early:

https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/internal/pkg/agent/cmd/run.go#L196-L201

Loading inputs happens when application.New() is called later on in the same file.

https://github.com/elastic/elastic-agent/blob/4b404b18debf0f83d19fa72132a9907e5fcf8641/internal/pkg/agent/cmd/run.go#L286-L290

jlind23 commented 6 months ago

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

You mean that a not properly installed agent can not be properly uninstalled? Shall we really fail the uninstall if an .exe that is supposed to be removed is not present anymore?

pierrehilbert commented 6 months ago

I don't think we should, this is not a great flow for our users. I would rather be more tolerant when we are uninstalling the Agent but about it but if I remember well, we are using the same part of the code in both the installation and the uninstallation.

cmacknz commented 6 months ago

Yes the uninstall ideally would tolerate this, but also there should not be a code path anywhere in the agent or MSI that leads to the apm-server binary file being removed from disk unexpectedly.

Let's focus on figuring out what happened there first. This is the first time I have ever seen this error, and it correlates exactly with the introduction of our MSI, which is certainly suspicious.

FlorianHeigl commented 6 months ago

Do our tests check that elastic-agent status works and returns a healthy agent? I wonder if this is a problem with the install that uninstall detects, instead of an uninstall related bug.

You mean that a not properly installed agent can not be properly uninstalled? Shall we really fail the uninstall if an .exe that is supposed to be removed is not present anymore?

thou shalt not fail (IOW an uninstaller should stick to the FCC rule (do not cause interference, do not get disturbed by interference) (IOW you shouldn't have bugs, but you mustn't have a non-working uninstall. how it happened is less important for the SW lifecycle than not being able to handle it)

leandrojmp commented 5 months ago

Hello,

Just curious, could this happen on 8.12.1 as well? I'm having some issues to uninstall an Agent on Windows that for some reason ended up on a broken install and now we cannot uninstall it nor install again.

I have a support ticket open, but I'm still waiting for some feedback from support.

cmacknz commented 5 months ago

Just curious, could this happen on 8.12.1 as well? I'm having some issues to uninstall an Agent on Windows that for some reason ended up on a broken install and now we cannot uninstall it nor install again.

I suppose it is possible, because we never actually found the root cause for this problem and therefore there was no fix for it. Likely instead of fixing it we should just consider this non-fatal as suggested in https://github.com/elastic/elastic-agent/issues/4167#issuecomment-1927628753 but we haven't done that yet.

The error here requires that the contents of the components sub-directory of the agent be removed before elastic-agent uninstall is executed. Likely something other than the elastic-agent uninstall command would have to remove parts of the agent installation directory to hit this.