elastic / elastic-stack-installers

Windows MSI packages for Elastic stack
Apache License 2.0
2 stars 16 forks source link

Selectively skip and fix tests after PR#212 #235

Closed dliappis closed 9 months ago

dliappis commented 9 months ago

PR #212 made some changes on main that require the presence of MANIFEST_URL. However, the pipeline is still configured to run the same script when a PR (from an upstream branch) is raised and it currently fails.

This commit skips tests when triggered from another pipeline (unified release), and allows running tests when triggered by PR creation by picking up a suitable MANIFEST_URL.

dliappis commented 9 months ago

This is ready for review. Because of the multiple ways that this pipeline may run -- via unified release trigger, via branch commits or pr commits -- we need to support a different way of running depending on the occasion:

1) If we open the PR it (e.g. the latest commit here: https://buildkite.com/elastic/elastic-stack-installers/builds/3720) it looks like:

![image](https://github.com/elastic/elastic-stack-installers/assets/1754575/ec897541-f881-42d5-9ef6-93e91246fab7)

2) If the pipeline gets triggered by the unified release it should run as before (and allow to go through the DRA publish).

3) Special handling is done for staging builds:

- Unified release may trigger the pipeline but only for non main branches and specifically if they target `[0-9]+.[0-9]+` matching branches. Tests are skipped.
- PRs are only allowed to run the steps (including tests) if they target non-main branch (the reason is that there is no https://staging.elastic.co/latest/main.json, so we can't deduce a valid `MANIFEST_URL`)

4) We've removed runs based on push events -- e.g. cutting a new branch from main -- (which used to be allowed before); this is because it complicates unnecessarily the conditionals. This use case (and more) will be covered in a follow up PR, through a scheduled job that runs at regular intervals.

Example run: https://buildkite.com/elastic/elastic-stack-installers/builds/3720 using this PR

dliappis commented 9 months ago

@strawgate the tests seem to be failing recently with unknown flag: --delayenroll, see https://buildkite.com/elastic/elastic-stack-installers/builds/3720#018d785e-7668-4794-8a0f-2f39dbe52b5e/3827-4851

any thoughts on this?

strawgate commented 9 months ago

@strawgate the tests seem to be failing recently with unknown flag: --delayenroll, see https://buildkite.com/elastic/elastic-stack-installers/builds/3720#018d785e-7668-4794-8a0f-2f39dbe52b5e/3827-4851

any thoughts on this?

That's just the debug output for a negative test case. The failing test case is:


[-] Elastic Agent MSI Installer.Basic Tests for Standalone mode.Blocks downgrades and upgrades in Standalone mode 44.52s (44.52s\|4ms)
--
  | [0] Expected $true, but got $false.
  | at Has-AgentStandaloneLog \| Should -BeTrue, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:28
  | at <ScriptBlock>, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:28
  | at Assert-AgentHealthy, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:119
  | at <ScriptBlock>, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:177
  | [1] Expected $false, because The agent should have been cleaned up already, but got $true.
  | at Is-AgentBinaryPresent \| Should -BeFalse -Because "The agent should have been cleaned up already", C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:72
  | at Check-AgentRemnants, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:72
  | at <ScriptBlock>, C:\buildkite-agent\builds\bk-agent-prod-gcp-1707121483772742297\elastic\elastic-stack-installers\src\agent-qa\msi.tests.ps1:109

Looking into this

strawgate commented 9 months ago

Pushed a commit, i think it was just a race condition between the agent starting up and us checking its log file

strawgate commented 9 months ago

Build failed with:


Starting a Gradle Daemon, 1 incompatible Daemon could not be reused, use --status for details
--
  |  
  | FAILURE: Build failed with an exception.
  |  
  | * What went wrong:
  | The specified project directory '/release/project-configs/fix-pr-tests' does not exist.
  |  
  | * Try:
  | > Run with --stacktrace option to get the stack trace.
  | > Run with --info or --debug option to get more log output.
  | > Run with --scan to get full insights.
  |  
  | * Get more help at https://help.gradle.org
  |  
  | BUILD FAILED in 2s
  | 🚨 Error: The command exited with status 1

re-running build now

dliappis commented 9 months ago

Build failed with:


Starting a Gradle Daemon, 1 incompatible Daemon could not be reused, use --status for details
--
  |  
  | FAILURE: Build failed with an exception.
  |  
  | * What went wrong:
  | The specified project directory '/release/project-configs/fix-pr-tests' does not exist.
  |  
  | * Try:
  | > Run with --stacktrace option to get the stack trace.
  | > Run with --info or --debug option to get more log output.
  | > Run with --scan to get full insights.
  |  
  | * Get more help at https://help.gradle.org
  |  
  | BUILD FAILED in 2s
  | 🚨 Error: The command exited with status 1

re-running build now

This is an unrelated failure in the DRA publish step: we allow running the publish step in dry run mode from within PRs, but it seems to me that we shouldn't: the unified release expected certain hard coded directories (e.g. 8.13, main etc.) so maybe the best approach would be to skip the DRA step altogether when running from a PR. @DaveSys911 what do you think?

dliappis commented 9 months ago

@strawgate FYI we are still having failures related to tests after your latest commits: https://buildkite.com/elastic/elastic-stack-installers/builds/3747

strawgate commented 9 months ago

@strawgate FYI we are still having failures related to tests after your latest commits: https://buildkite.com/elastic/elastic-stack-installers/builds/3747

Yep will look today, my last commit was to add some additional debug info

strawgate commented 9 months ago

It looks like the last test run passed, will rerun a couple of times to make sure it's not flaky

dliappis commented 9 months ago

@strawgate your latest commit seems to have fixed the tests: https://buildkite.com/elastic/elastic-stack-installers/builds/3750#018d7eb8-cb49-4989-ad1b-f5314ec4f0d6 The DRA publish step is failing because we are running from a PR (and the branch in the PR is not main or 7.x or 8.x). I pushed a change to skip this test, if running from within the context of a PR. Hopefully this will lead to 100% successful PR checks.

It looks like the last test run passed, will rerun a couple of times to make sure it's not flaky

Sounds like a great idea!

dliappis commented 9 months ago

It looks like the last test run passed, will rerun a couple of times to make sure it's not flaky

We got another fresh failure in https://buildkite.com/elastic/elastic-stack-installers/builds/3751#018d7ee0-3fea-4c2a-abfd-0d6f6f6969e3

strawgate commented 9 months ago

fingers crossed on this one -- if this last change doesn't fix it i'll disable the test case that triggers the race condition and fix it in a separate PR

strawgate commented 9 months ago

Tests passed with latest change, running another retest to check for flaky tests

dliappis commented 9 months ago

@strawgate I think we are good here. I just triggered one more build to check for flakiness: https://buildkite.com/elastic/elastic-stack-installers/builds/3794

could you provide a final review so that we can merge?

Also @amitkanfer could please take a look again?

dliappis commented 9 months ago

buildkite test this

strawgate commented 9 months ago

could you provide a final review so that we can merge?

done!