[ci] Provide some mechanism to follow the status of a specific FTR config

spalger commented 2 years ago

When CI was broken up manually into CI Groups you could watch a specific CI Group which you knew included some test that you were working on and when it passed you knew your work was done. We lost the ability to do that when we moved to dynamically allocated FTR Config Groups because configs move around and are all in anonymous FTR Configs #X/Y groups so the only option is to wait for CI to finish completely.

I have a couple ideas for how we might address this, but I'm open to suggestions:

Automatically detect FTR configs which failed (not flaky) in the previous build of a PR and run them in a separate group
Allow authors to list specific configs which they want to highlight in the description of a PR and run those "configs of interest" in a separate group

These "separate groups" would really be group types, which are planned and automatically split up based on the expected execution time of those tests. They would often only include a single config but importantly they would report a unique status item to github and run in a separate job in Buildkite so the status of those interesting configs could be watched by PR authors.

We should be able to do just about all of this logic in the ci-stats API, but will need to update the kibana-buildkite-library to upload the right pipeline based on the results.

Thoughts?

elasticmachine commented 2 years ago

Pinging @elastic/kibana-operations (Team:Operations)

mattkime commented 2 years ago

Broadly speaking, working on APIs differs from working on kibana app functionality - in one case you know which suites you're interested in and the other you really have no idea what might break.

Placing previously failed test runs in a separate group could certainly give faster feedback.

Suggestion - use a github check for each FTR config. This would be more granular than the suites we had before but also more meaningful.

spalger commented 2 years ago

Discussed with @mattkime and @brianseeders today, we're going to try running any FTR config that is expected to execute over 2-3 minutes in it's own worker, then all the rest of the configs in small FTR config groups (mostly FTR configs where all tests are skipped). The hope here is to reach a compromise where logs are as accessible as possible, CI can continue to scale while reducing costs, and users have a better experience because statuses will mostly be assigned to specific FTR configs and links will take you directly to the log output of that config.

pheyos commented 2 years ago

I think besides watching previously failed tests, one other aspect is the addition of new tests as part of a PR, where the author has a particular interest in seeing the successful execution and maybe also the execution time. I like the idea to run many of the configs in separate workers, which allows to follow the test groups more closely.

elastic / kibana

[ci] Provide some mechanism to follow the status of a specific FTR config #131879