Open pchila opened 3 months ago
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Did some additional runs on a Windows VM... It seems that the issue is intermittent which suggests some kind of race condition that make the test fail sometimes on windows unprivileged. In the specific instance I managed to reproduce only once over 3 runs of the specific test on a Win10 VM
That this error has only ever been seen in https://github.com/elastic/elastic-agent/pull/4770, and that PR also changes the policy change logic, makes me suspect it could be related to something in https://github.com/elastic/elastic-agent/pull/4770. It is also possible you've added test coverage to something that was always flaky, and now we can observe it.
Access is denied
is what you get if you try to remove a file while something else has it open. I don't think it necessarily has to do with the privilege level of the agent.
Yeah I agree that its possible this is coming from this PR. I just looked at main to try and determine how I can reproduce this and I see that this test is skipped https://github.com/elastic/elastic-agent/blob/main/testing/integration/proxy_url_test.go#L35.
Happy to help work with you on this error and why you think its specifically related to unprivileged mode and permissions on Windows.
After further development done in #4770 the proxy url tests are enabled again (those were disabled due to #4861)
In #4770 I changed a log level in the agent to see what error may cause an action dispatch failure https://github.com/elastic/elastic-agent/pull/4770/files#diff-b1931bdf74ee7a85f4291c623d9a21868c438c9ef9829e8e939f2b43906b3535R159 (before it was a simple debug log which was not printed most of the time), so that may have revealed an error that we already have before the changes...
The code changes for mTLS did not add any extra reading/writing of fleet.enc
so if there is a race condition, it's likely that we already have it in main...
I will keep investigating trying to figure out what triggers the access denied
error
I am going to disable the ProxyURL tests on windows until the access denied is resolved
After a few tries we got a CI run that contains useful information about the issue: https://buildkite.com/elastic/elastic-agent-extended-testing/builds/810#01903606-b9b9-4977-bc0d-b2327f19b1f1
Grepping fleet.enc
from the logs contained in the diagnostics we can see
➜ TestProxyURL-EnrollWithProxy-PolicyProxyTakesPrecedence-diagnostics-2024-06-20T14-59-38Z grep -ir fleet.enc ./* | sort
<omitted...>
{"file.name":"storage/encrypted_disk_store.go","file.line":129},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc started at 2024-06-20 14:54:17.257021 +0000 GMT m=+1.117881501\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":41},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc security descriptor: O:S-1-5-21-310561555-502049046-1354876720-1009G:S-1-5-21-310561555-502049046-1354876720-513","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.282Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":61},"message":"C:\\Program Files\\Elastic\\Agent\\fleet.enc stat:\n{\n \"FileAttributes\": 32,\n \"CreationTime\": {\n \"LowDateTime\": 3101609211,\n \"HighDateTime\": 31114017\n },\n \"LastAccessTime\": {\n \"LowDateTime\": 3101784772,\n \"HighDateTime\": 31114017\n },\n \"LastWriteTime\": {\n \"LowDateTime\": 3101784772,\n \"HighDateTime\": 31114017\n },\n \"FileSizeHigh\": 0,\n \"FileSizeLow\": 274,\n \"ReparseTag\": 0\n}\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_storage_debug_windows.go","file.line":53},"message":"owner for \"C:\\\\Program Files\\\\Elastic\\\\Agent\\\\fleet.enc\": OGC-WINDOWS-AMD\\elastic-agent-user account type 1","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-2.ndjson:{"log.level":"info","@timestamp":"2024-06-20T14:54:17.288Z","log.logger":"encrypted-disk-store-debug","log.origin":{"file.name":"storage/encrypted_disk_store.go","file.line":131},"message":"Save of C:\\Program Files\\Elastic\\Agent\\fleet.enc finished at 2024-06-20 14:54:17.2883767 +0000 GMT m=+1.149237201\n","ecs.version":"1.6.0"}
./logs/elastic-agent-8.15.0-SNAPSHOT-fbeed0/elastic-agent-20240620-3.ndjson:{"log.level":"error","@timestamp":"2024-06-20T14:54:35.407Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":161},"message":"Failed to dispatch action id \"actionID-TestValidProxyInThePolicy\" of type \"POLICY_CHANGE\", error: saving config: fail to persist new Fleet Server API client hosts: could not replace target file C:\\Program Files\\Elastic\\Agent\\fleet.enc: rename C:\\Program Files\\Elastic\\Agent\\fleet.enc.tmp C:\\Program Files\\Elastic\\Agent\\fleet.enc: Access is denied.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
<omitted...>
Along with the logs from the test report
[ ===] Waiting For Enroll... [4s] {"log.level":"warn","@timestamp":"2024-06-20T14:54:16.230Z","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":107},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-06-20T14:54:16.783Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":518},"message":"Starting enrollment to URL: http://fleet.elastic.co/","ecs.version":"1.6.0"}
Load of C:\Program Files\Elastic\Agent\fleet.enc started at 2024-06-20 14:54:16.9998057 +0000 GMT m=+0.860666201Load of C:\Program Files\Elastic\Agent\fleet.enc finished at 2024-06-20 14:54:17.0038126 +0000 GMT m=+0.864673101{"log.level":"info","@timestamp":"2024-06-20T14:54:17.674Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":481},"message":"Restarting agent daemon, attempt 0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-06-20T14:54:17.683Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":299},"message":"Successfully triggered restart on running Elastic Agent.","ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
If we look at the timestamp when the enroll command is restarting agent after writing fleet.enc 2024-06-20T14:54:17.674Z
and the logs just before logging the access denied
claiming that saving of fleet.enc completed at 2024-06-20 14:54:17.2883767 +0000 GMT
we can assume that there's some interaction between writing from enroll and saving the very first policy data.
This appears in the tests for 2 reasons:
On windows having such a short time where 2 processes try to modify the same file seems to cause the access denied error, maybe this can be mitigated by making the mock fleet server slow down a few milliseconds before sending out the action or having it resend on the next checkin if it's not been acknowledged.
oh joy.
In both cases I think we are using the file.SafeFileRotate
to create a temp file and then rename it to the destination. On Windows the Rename ends up being a call to https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefileexw with MOVEFILE_REPLACE_EXISTING option. We probably need to make sure that file.SafeFileRotate
doesn't have a race condition when called in rapid succession.
We discuss this issue in the weekly meeting and I wanted to document it here. I believe the following change will result in this no longer having a race condition and overall will provide a better experience for installation.
While validating PR #4770
TestProxyURL
test fails withCI run -> https://buildkite.com/elastic/elastic-agent-extended-testing/builds/538#01900846-16ad-428c-8a60-5a48324d55f5
As a workaround the test now install elastic-agent as privileged on Windows (with this the test runs correctly) but it seems that there are some issues with
fleet.enc
permissions in unprivileged mode not directly related to the developed enhancement.This issue is there to track this issue and collect results of further investigation,