elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 133 forks source link

[Flaky Test]: TestFleetManagedUpgradeUnprivileged, TestFleetManagedUpgradePrivileged – context deadline exceeded #4339

Open rdner opened 6 months ago

rdner commented 6 months ago

Failing test case

TestFleetManagedUpgradeUnprivileged, TestFleetManagedUpgradePrivileged

Error message

context deadline exceeded

Build

https://buildkite.com/elastic/elastic-agent/builds/7484#018de61d-8da1-4244-a652-cfa842a3c7ed

OS

Linux, Windows

Stacktrace and notes

upgrade_fleet_test.go:279: Upgrading from version "8.14.0-SNAPSHOT-232235d5c12471660fe46b5a1fd5366357472e90" to version "8.14.0-SNAPSHOT-5e17bc222dfa1475bbdf8cade30d67a44dceb436"...
    upgrade_fleet_test.go:285: Waiting from upgrade details to show up in Fleet
    upgrade_fleet_test.go:298: Waiting for upgrade watcher to start...
    upgrade_fleet_test.go:300: 
            Error Trace:    /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:49
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestFleetManagedUpgradeUnprivileged
            Messages:       upgrade watcher did not start
    fixture_install.go:211: [test TestFleetManagedUpgradeUnprivileged] Inside fixture cleanup function
    fixture_install.go:227: collecting diagnostics; test failed
    fixture.go:632: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestFleetManagedUpgradeUnprivileged-diagnostics-2024-02-26T16:24:07Z.zip]
    fixture.go:632: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
    fixture_install.go:186: Dumping running processes in /home/ubuntu/agent/build/diagnostics/TEST-TestFleetManagedUpgradeUnprivileged-linux-arm64-ProcessDump.json
--- FAIL: TestFleetManagedUpgradeUnprivileged (404.79s)

=======

  fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
    upgrader.go:491: waiting for healthy agent and proper version: could not unmarshal agent status output: error: context deadline exceeded, output: 
        unexpected end of JSON input
    upgrade_fleet_test.go:305: 
            Error Trace:    C:/Users/windows/agent/testing/integration/upgrade_fleet_test.go:305
                                        C:/Users/windows/agent/testing/integration/upgrade_fleet_test.go:103
                                        C:/Users/windows/agent/testing/integration/upgrade_fleet_test.go:63
            Error:          Received unexpected error:
                            failed waiting for healthy agent and version (context deadline exceeded): could not unmarshal agent status output: error: context deadline exceeded, output: 
                            unexpected end of JSON input
            Test:           TestFleetManagedUpgradePrivileged
    fixture_install.go:211: [test TestFleetManagedUpgradePrivileged] Inside fixture cleanup function
    fixture_install.go:227: collecting diagnostics; test failed
    fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe diagnostics -f C:\Users\windows\agent\build\diagnostics\TestFleetManagedUpgradePrivileged-diagnostics-2024-02-26T16-24-12Z.zip]
    fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe uninstall --force]
    fixture_install.go:186: Dumping running processes in C:\Users\windows\agent\build\diagnostics\TEST-TestFleetManagedUpgradePrivileged-windows-amd64-ProcessDump.json
--- FAIL: TestFleetManagedUpgradePrivileged (328.62s)
elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 6 months ago

My initial suspicion is that this is another example of the hash the tests downloading from staging or snapshot not matching what the agent downloaded.

upgrader.go:491: waiting for healthy agent and proper version: commits don't match: got 8c6e7e59fdde1e4e9ab6ef9394e6c6a5b13c9628, want 5e17bc222dfa1475bbdf8cade30d67a44dceb436
    fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
    upgrader.go:491: waiting for healthy agent and proper version: commits don't match: got 8c6e7e59fdde1e4e9ab6ef9394e6c6a5b13c9628, want 5e17bc222dfa1475bbdf8cade30d67a44dceb436
    fixture.go:632: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
rdner commented 6 months ago

@cmacknz could be, I'll keep my eyes on it.

rdner commented 6 months ago

A similar failure in this build https://buildkite.com/elastic/elastic-agent/builds/7735#018e351e-0af4-43f0-ac0c-8b3c04ac941b

rdner commented 6 months ago

Another failure in https://buildkite.com/elastic/elastic-agent/builds/7772#018e3986-f404-4d25-864f-283da07ce772

upgrade_fleet_test.go:298: Waiting for upgrade watcher to start...
    upgrade_fleet_test.go:300: 
            Error Trace:    /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestFleetManagedUpgradePrivileged
            Messages:       upgrade watcher did not start
    fixture_install.go:233: [test TestFleetManagedUpgradePrivileged] Inside fixture cleanup function
rdner commented 6 months ago

I suspect this problem can be caused by the CDN issues when the download gets stuck. Implementing this improvement should help https://github.com/elastic/elastic-agent/issues/4409

rdner commented 6 months ago

Another perhaps related failure in https://buildkite.com/elastic/elastic-agent/builds/7735#018e351e-0af4-43f0-ac0c-8b3c04ac941b

I find it suspicious that these issues are only spotted in upgrades involving Fleet.

cmacknz commented 6 months ago

I find it suspicious that these issues are only spotted in upgrades involving Fleet.

The tests work a bit differently, the way standalone tests are done are less likely to hit problems with the CDN download because they only do it once outside of the agent.

Our standalone upgrades download the upgrade artifact to disk once, and then use elastic-agent update --source-uri file://... to have the agent "download" that artifact from disk. We also set --skip-verify so we don't need a valid GPG signature.

https://github.com/elastic/elastic-agent/blob/6135d60613964712ae7b574f04b4ad112acecba7/testing/upgradetest/upgrader.go#L588

For a Fleet upgrade we have to upgrade to an official release or a snapshot with a valid GPG signature (Fleet doesn't and shouldn't support --skip-verify), so the agent itself is always downloading the upgrade artifact from the CDN.

https://github.com/elastic/elastic-agent/blob/6135d60613964712ae7b574f04b4ad112acecba7/testing/integration/upgrade_fleet_test.go#L279-L283

rdner commented 5 months ago

Another failure in 8.13 https://buildkite.com/elastic/elastic-agent/builds/8250#018ec323-31b1-47e3-8e4f-9f6a848f57ce

cmacknz commented 5 months ago

Another failure in 8.13 https://buildkite.com/elastic/elastic-agent/builds/8250#018ec323-31b1-47e3-8e4f-9f6a848f57ce

In these diagnostics the update state shows as UPG_EXTRACTING which is odd.

upgrade_details:
    action_id: 71691d31-3bc3-4196-b112-16eea2b69996
    metadata:
        download_percent: 1
        retry_until: null
    state: UPG_EXTRACTING
    target_version: 8.13.2-SNAPSHOT

Here is the full logs of the upgrade which span from 2024-04-09T14:20:09.500Z to 2024-04-09T14:23:52.070Z over 3m43s.

```json {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"handlers/handler_action_upgrade.go","file.line":77},"message":"starting upgrade to version 8.13.2-SNAPSHOT in background","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_REQUESTED","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":162},"message":"Upgrading agent","log":{"source":"elastic-agent"},"version":"8.13.2-SNAPSHOT","source_uri":"","ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"upgrade/cleanup.go","file.line":22},"message":"Cleaning up non-matching downloaded versions","log":{"source":"elastic-agent"},"version":"8.13.2","downloads.path":"/opt/Elastic/Agent/data/elastic-agent-8.13.2-SNAPSHOT-bc25f4/downloads","ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"upgrade/step_download.go","file.line":91},"message":"Downloading upgrade artifact","log":{"source":"elastic-agent"},"version":"8.13.2-SNAPSHOT","source_uri":"https://artifacts.elastic.co/downloads/","drop_path":"","target_path":"/opt/Elastic/Agent/data/elastic-agent-8.13.2-SNAPSHOT-bc25f4/downloads","install_path":"/opt/Elastic/Agent/data/elastic-agent-8.13.2-SNAPSHOT-bc25f4/install","ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.500Z","log.origin":{"file.name":"upgrade/step_download.go","file.line":242},"message":"download attempt 1","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.501Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:20:09.795Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.363Z","log.origin":{"file.name":"http/progress_observer.go","file.line":71},"message":"download from https://snapshots.elastic.co/8.13.2-948b05e8/downloads/beats/elastic-agent/elastic-agent-8.13.2-SNAPSHOT-linux-arm64.tar.gz completed in 3 minutes @ 3.355MBps","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.363Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1,"download_rate":"3.35518191MBps","retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.405Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1,"download_rate":"3.35518191MBps","retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.405Z","log.origin":{"file.name":"http/progress_observer.go","file.line":71},"message":"download from https://snapshots.elastic.co/8.13.2-948b05e8/downloads/beats/elastic-agent/elastic-agent-8.13.2-SNAPSHOT-linux-arm64.tar.gz.sha512 completed in Less than a second @ +InfYBps","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.405Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1,"retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.405Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1,"retry_until":"2024-04-09T16:20:09.500963627Z"}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:31.405Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_DOWNLOADING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1}},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:38.240Z","log.origin":{"file.name":"download/verifier.go","file.line":248},"message":"Default PGP appended","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"warn","@timestamp":"2024-04-09T14:23:38.481Z","log.origin":{"file.name":"download/verifier.go","file.line":287},"message":"Failed to fetch remote PGP key from \"https://3caeb4dcf5a44f78ac0b97f11a7fe8ed.fleet.us-west2.gcp.elastic-cloud.com:443/api/agents/upgrades/8.13.2/pgp-public-key\": call to 'https://3caeb4dcf5a44f78ac0b97f11a7fe8ed.fleet.us-west2.gcp.elastic-cloud.com:443/api/agents/upgrades/8.13.2/pgp-public-key' returned unsuccessful status code: 401","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:38.481Z","log.origin":{"file.name":"download/verifier.go","file.line":273},"message":"Using 2 PGP keys","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"warn","@timestamp":"2024-04-09T14:23:38.481Z","log.origin":{"file.name":"composed/verifier.go","file.line":54},"message":"Verifier failed!","log":{"source":"elastic-agent"},"verifier":"fs.verifier","error":{"message":"could not get .asc file: fetching asc file from '/opt/Elastic/Agent/data/elastic-agent-8.13.2-SNAPSHOT-bc25f4/downloads/elastic-agent-8.13.2-SNAPSHOT-linux-arm64.tar.gz.asc': open /opt/Elastic/Agent/data/elastic-agent-8.13.2-SNAPSHOT-bc25f4/downloads/elastic-agent-8.13.2-SNAPSHOT-linux-arm64.tar.gz.asc: no such file or directory"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:45.149Z","log.origin":{"file.name":"download/verifier.go","file.line":248},"message":"Default PGP appended","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"warn","@timestamp":"2024-04-09T14:23:45.338Z","log.origin":{"file.name":"download/verifier.go","file.line":287},"message":"Failed to fetch remote PGP key from \"https://3caeb4dcf5a44f78ac0b97f11a7fe8ed.fleet.us-west2.gcp.elastic-cloud.com:443/api/agents/upgrades/8.13.2/pgp-public-key\": call to 'https://3caeb4dcf5a44f78ac0b97f11a7fe8ed.fleet.us-west2.gcp.elastic-cloud.com:443/api/agents/upgrades/8.13.2/pgp-public-key' returned unsuccessful status code: 401","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:45.338Z","log.origin":{"file.name":"download/verifier.go","file.line":273},"message":"Using 2 PGP keys","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:52.070Z","log.origin":{"file.name":"download/verifier.go","file.line":210},"message":"Verification with PGP[0] successful","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2024-04-09T14:23:52.070Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":522},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.13.2-SNAPSHOT","state":"UPG_EXTRACTING","action_id":"71691d31-3bc3-4196-b112-16eea2b69996","metadata":{"download_percent":1}},"ecs.version":"1.6.0"} ```

We never made it to the point where we start the watcher, so the test is right to time out waiting for it:

            Error Trace:    /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103
                                        /home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63
            Error:          Received unexpected error:
                            context deadline exceeded

https://github.com/elastic/elastic-agent/blob/bc25f47b9fb0b20f57bfd023768fae12031489d6/testing/integration/upgrade_fleet_test.go#L297-L301

Looking at the timestamps in the logs below, we spend the full 5 minutes waiting for the watcher from 2024-04-09T14:20:12.523582871Z to 2024-04-09T14:25:12.550663015Z. From above the download started extracting at 2024-04-09T14:23:52.070Z so of this 5 minutes we only waited for the extraction to finish and then to start the watcher for effectively 1m20s during which we didn't finish extracting the upgrade artifact.

{"Time":"2024-04-09T14:20:12.523582871Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"    upgrade_fleet_test.go:298: Waiting for upgrade watcher to start...\n"}
{"Time":"2024-04-09T14:25:12.550663015Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"    upgrade_fleet_test.go:300: \n"}
{"Time":"2024-04-09T14:25:12.550732135Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"        \tError Trace:\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:300\n"}
{"Time":"2024-04-09T14:25:12.550739855Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"        \t            \t\t\t\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:103\n"}
{"Time":"2024-04-09T14:25:12.550744015Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"        \t            \t\t\t\t/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:63\n"}
{"Time":"2024-04-09T14:25:12.550746695Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"        \tError:      \tReceived unexpected error:\n"}
{"Time":"2024-04-09T14:25:12.550752535Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204-fleet-privileged)(sudo)","Test":"TestFleetManagedUpgradePrivileged","Output":"        \t            \tcontext deadline exceeded\n"}

Looking at this the two possibilities are:

  1. We got stuck during the extraction step. We probably need more default progress reporting here.
  2. More likely, the machine is just slow, and the timeout for the watcher shouldn't start until we've gotten through the download and extraction steps. We spent 3m downloading the upgrade and 1m+ extracting it here.
rdner commented 4 months ago

Another failure https://buildkite.com/elastic/elastic-agent/builds/8497#018f0c6e-f2b8-4051-9e05-c902ef25c8be