elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
127 stars 136 forks source link

[Windows] TestProxyURL fails with `access denied` error on `fleet.enc` #4913

Open pchila opened 3 months ago

pchila commented 3 months ago

While validating PR #4770 TestProxyURL test fails with

{"log.level":"error","@timestamp":"2024-06-11T17:55:59.378Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":161},"message":"Failed to dispatch action id \"actionID-TestValidProxyInThePolicy\" of type \"POLICY_CHANGE\", error: saving config: fail to persist new Fleet Server API client hosts: could not replace target file C:\\Program Files\\Elastic\\Agent\\fleet.enc: rename C:\\Program Files\\Elastic\\Agent\\fleet.enc.tmp C:\\Program Files\\Elastic\\Agent\\fleet.enc: Access is denied.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

CI run -> https://buildkite.com/elastic/elastic-agent-extended-testing/builds/538#01900846-16ad-428c-8a60-5a48324d55f5

As a workaround the test now install elastic-agent as privileged on Windows (with this the test runs correctly) but it seems that there are some issues with fleet.enc permissions in unprivileged mode not directly related to the developed enhancement.

This issue is there to track this issue and collect results of further investigation,

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pchila commented 3 months ago

Did some additional runs on a Windows VM... It seems that the issue is intermittent which suggests some kind of race condition that make the test fail sometimes on windows unprivileged. In the specific instance I managed to reproduce only once over 3 runs of the specific test on a Win10 VM

cmacknz commented 3 months ago

That this error has only ever been seen in https://github.com/elastic/elastic-agent/pull/4770, and that PR also changes the policy change logic, makes me suspect it could be related to something in https://github.com/elastic/elastic-agent/pull/4770. It is also possible you've added test coverage to something that was always flaky, and now we can observe it.

Access is denied is what you get if you try to remove a file while something else has it open. I don't think it necessarily has to do with the privilege level of the agent.

blakerouse commented 3 months ago

Yeah I agree that its possible this is coming from this PR. I just looked at main to try and determine how I can reproduce this and I see that this test is skipped https://github.com/elastic/elastic-agent/blob/main/testing/integration/proxy_url_test.go#L35.

Happy to help work with you on this error and why you think its specifically related to unprivileged mode and permissions on Windows.

pchila commented 3 months ago

After further development done in #4770 the proxy url tests are enabled again (those were disabled due to #4861)

In #4770 I changed a log level in the agent to see what error may cause an action dispatch failure https://github.com/elastic/elastic-agent/pull/4770/files#diff-b1931bdf74ee7a85f4291c623d9a21868c438c9ef9829e8e939f2b43906b3535R159 (before it was a simple debug log which was not printed most of the time), so that may have revealed an error that we already have before the changes...

The code changes for mTLS did not add any extra reading/writing of fleet.enc so if there is a race condition, it's likely that we already have it in main...

I will keep investigating trying to figure out what triggers the access denied error

pchila commented 3 months ago

Another occurrence https://buildkite.com/elastic/elastic-agent-extended-testing/builds/680#019024ab-6827-4cec-8866-b5cd1284b580

pchila commented 3 months ago

I am going to disable the ProxyURL tests on windows until the access denied is resolved

pchila commented 3 months ago

After a few tries we got a CI run that contains useful information about the issue: https://buildkite.com/elastic/elastic-agent-extended-testing/builds/810#01903606-b9b9-4977-bc0d-b2327f19b1f1

Grepping fleet.enc from the logs contained in the diagnostics we can see

➜  TestProxyURL-EnrollWithProxy-PolicyProxyTakesPrecedence-diagnostics-2024-06-20T14-59-38Z grep -ir fleet.enc ./* | sort
<omitted...>
{"file.name":"storage/encrypted_disk_store.go","file.line":129},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc started at 2024-06-20 14:54:17.257021 +0000 GMT m=+1.117881501\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":41},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc security descriptor: O:S-1-5-21-310561555-502049046-1354876720-1009G:S-1-5-21-310561555-502049046-1354876720-513","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":61},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc stat:\n{\n  \"FileAttributes\": 32,\n  \"CreationTime\": {\n    \"LowDateTime\": 3101609211,\n    \"HighDateTime\": 31114017\n  },\n  \"LastAccessTime\": {\n    \"LowDateTime\": 3101784772,\n    \"HighDateTime\": 31114017\n  },\n  \"LastWriteTime\": {\n    \"LowDateTime\": 3101784772,\n    \"HighDateTime\": 31114017\n  },\n  \"FileSizeHigh\": 0,\n  \"FileSizeLow\": 274,\n  \"ReparseTag\": 0\n}\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":53},"message":"owner for \"C:\\\\Program Files\\\\Elastic\\\\Agent\\\\fleet.enc\": OGC-WINDOWS-AMD\\elastic-agent-user account type 1","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_store.go","file.line":131},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc finished at 2024-06-20 14:54:17.2883767 +0000 GMT m=+1.149237201\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-3.ndjson:{"log.level":"error","@timestamp":"2024-06-20T14:54:35.407Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":161},"message":"Failed to dispatch action id \"actionID-TestValidProxyInThePolicy\" of type \"POLICY_CHANGE\", error: saving config: fail to persist new Fleet Server API client hosts: could not replace target file C:\\Program Files\\Elastic\\Agent\\fleet.enc: rename C:\\Program Files\\Elastic\\Agent\\fleet.enc.tmp C:\\Program Files\\Elastic\\Agent\\fleet.enc: Access is denied.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

<omitted...>

Along with the logs from the test report

[ ===] Waiting For Enroll...  [4s] {"log.level":"warn","@timestamp":"2024-06-20T14:54:16.230Z","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":107},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
        {"log.level":"info","@timestamp":"2024-06-20T14:54:16.783Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: http://fleet.elastic.co/","ecs.version":"1.6.0"}
        Load of C:\Program Files\Elastic\Agent\fleet.enc started at 2024-06-20 14:54:16.9998057 +0000 GMT m=+0.860666201Load of C:\Program Files\Elastic\Agent\fleet.enc finished at 2024-06-20 14:54:17.0038126 +0000 GMT m=+0.864673101{"log.level":"info","@timestamp":"2024-06-20T14:54:17.674Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":481},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
        {"log.level":"info","@timestamp":"2024-06-20T14:54:17.683Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":299},"message":"Successfully triggered restart on running Elastic Agent.","ecs.version":"1.6.0"}
        Successfully enrolled the Elastic Agent.

If we look at the timestamp when the enroll command is restarting agent after writing fleet.enc 2024-06-20T14:54:17.674Z and the logs just before logging the access denied claiming that saving of fleet.enc completed at 2024-06-20 14:54:17.2883767 +0000 GMT we can assume that there's some interaction between writing from enroll and saving the very first policy data.

This appears in the tests for 2 reasons:

On windows having such a short time where 2 processes try to modify the same file seems to cause the access denied error, maybe this can be mitigated by making the mock fleet server slow down a few milliseconds before sending out the action or having it resend on the next checkin if it's not been acknowledged.

leehinman commented 3 months ago

oh joy.

In both cases I think we are using the file.SafeFileRotate to create a temp file and then rename it to the destination. On Windows the Rename ends up being a call to https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefileexw with MOVEFILE_REPLACE_EXISTING option. We probably need to make sure that file.SafeFileRotate doesn't have a race condition when called in rapid succession.

blakerouse commented 3 months ago

We discuss this issue in the weekly meeting and I wanted to document it here. I believe the following change will result in this no longer having a race condition and overall will provide a better experience for installation.

  1. Write the installation directory (but do not install the service)
  2. Perform enrollment inside of the installation directory (but do not tell it to restart the running daemon)
  3. Install the service
  4. Start the service