Open rdner opened 6 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
My initial suspicion is that this is another example of the hash the tests downloading from staging or snapshot not matching what the agent downloaded.
upgrader.go:491: waiting for healthy agent and proper version: commits don't match: got 8c6e7e59fdde1e4e9ab6ef9394e6c6a5b13c9628, want 5e17bc222dfa1475bbdf8cade30d67a44dceb436
fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
upgrader.go:491: waiting for healthy agent and proper version: commits don't match: got 8c6e7e59fdde1e4e9ab6ef9394e6c6a5b13c9628, want 5e17bc222dfa1475bbdf8cade30d67a44dceb436
fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
@cmacknz could be, I'll keep my eyes on it.
A similar failure in this build https://buildkite.com/elastic/elastic-agent/builds/7735#018e351e-0af4-43f0-ac0c-8b3c04ac941b
Another failure in https://buildkite.com/elastic/elastic-agent/builds/7772#018e3986-f404-4d25-864f-283da07ce772
upgrade_fleet_test.go:298: Waiting for upgrade watcher to start...
upgrade_fleet_test.go:300:
Error Trace: /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300
/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103
/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63
Error: Received unexpected error:
context deadline exceeded
Test: TestFleetManagedUpgradePrivileged
Messages: upgrade watcher did not start
fixture_install.go:233: [test TestFleetManagedUpgradePrivileged] Inside fixture cleanup function
I suspect this problem can be caused by the CDN issues when the download gets stuck. Implementing this improvement should help https://github.com/elastic/elastic-agent/issues/4409
Another perhaps related failure in https://buildkite.com/elastic/elastic-agent/builds/7735#018e351e-0af4-43f0-ac0c-8b3c04ac941b
I find it suspicious that these issues are only spotted in upgrades involving Fleet.
I find it suspicious that these issues are only spotted in upgrades involving Fleet.
The tests work a bit differently, the way standalone tests are done are less likely to hit problems with the CDN download because they only do it once outside of the agent.
Our standalone upgrades download the upgrade artifact to disk once, and then use elastic-agent update --source-uri file://...
to have the agent "download" that artifact from disk. We also set --skip-verify
so we don't need a valid GPG signature.
For a Fleet upgrade we have to upgrade to an official release or a snapshot with a valid GPG signature (Fleet doesn't and shouldn't support --skip-verify
), so the agent itself is always downloading the upgrade artifact from the CDN.
Another failure in 8.13 https://buildkite.com/elastic/elastic-agent/builds/8250#018ec323-31b1-47e3-8e4f-9f6a848f57ce
Another failure in 8.13 https://buildkite.com/elastic/elastic-agent/builds/8250#018ec323-31b1-47e3-8e4f-9f6a848f57ce
In these diagnostics the update state shows as UPG_EXTRACTING
which is odd.
upgrade_details:
action_id: 71691d31-3bc3-4196-b112-16eea2b69996
metadata:
download_percent: 1
retry_until: null
state: UPG_EXTRACTING
target_version: 8.13.2-SNAPSHOT
Here is the full logs of the upgrade which span from 2024-04-09T14:20:09.500Z
to 2024-04-09T14:23:52.070Z
over 3m43s
.
We never made it to the point where we start the watcher, so the test is right to time out waiting for it:
Error Trace: /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300
/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103
/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63
Error: Received unexpected error:
context deadline exceeded
Looking at the timestamps in the logs below, we spend the full 5 minutes waiting for the watcher from 2024-04-09T14:20:12.523582871Z
to 2024-04-09T14:25:12.550663015Z
. From above the download started extracting at 2024-04-09T14:23:52.070Z
so of this 5 minutes we only waited for the extraction to finish and then to start the watcher for effectively 1m20s
during which we didn't finish extracting the upgrade artifact.
{"Time":"2024-04-09T14:20:12.523582871Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" upgrade_fleet_test.go:298: Waiting for upgrade watcher to start...\n"}
{"Time":"2024-04-09T14:25:12.550663015Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" upgrade_fleet_test.go:300: \n"}
{"Time":"2024-04-09T14:25:12.550732135Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" \tError Trace:\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300\n"}
{"Time":"2024-04-09T14:25:12.550739855Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" \t \t\t\t\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103\n"}
{"Time":"2024-04-09T14:25:12.550744015Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" \t \t\t\t\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63\n"}
{"Time":"2024-04-09T14:25:12.550746695Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" \tError: \tReceived unexpected error:\n"}
{"Time":"2024-04-09T14:25:12.550752535Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":" \t \tcontext deadline exceeded\n"}
Looking at this the two possibilities are:
Failing test case
TestFleetManagedUpgradeUnprivileged, TestFleetManagedUpgradePrivileged
Error message
context deadline exceeded
Build
https://buildkite.com/elastic/elastic-agent/builds/7484#018de61d-8da1-4244-a652-cfa842a3c7ed
OS
Linux, Windows
Stacktrace and notes