Add tool for analyzing and reporting random CDash test failures

achauphan commented 6 months ago

Related issues

TRILFRAME-614: Tool to analyze dashboard failures
598
597
Trilinos/Trilinos#12391

Description

Random failures can bring down an entire CI iteration on a regular basis and waste resources whenever a retest is requested in order to pass the various checks of a pull request.

Spotting a randomly failing test requires a lot of manual CDash querying and analysis by the developer. However, in most cases, a developer may not have the time to trace, identify, and report the randomly failing test, and instead will opt to ignore it in favor of requesting a retest, leading to the previously stated point of wasting resources. This lack of reporting also leads to bigger issue in that it allows the randomly failing test to linger inside the code base and further affect developers in the future.

Proposed Solution

This issue proposes a new tool (which for now would live inside of TriBITS under tribits/ci_support) that can run automatically to query, scrape, analyze, and report tests that are deemed to be "randomly failing" to an operations team via email or an automated issue creation in the repository.

The definition for a randomly failing test will be a test that intermittently reports as passing or failing without any changes made to the topic or target branch being tested (topic and target tip SHA1 are the same) between CI testing iterations.

Fortunately, there is a lot of already existing work done that can be leveraged to build this tool in Python that already exists inside of tribits/ci_support. Notably, the module CreateIssueTrackerFromCDashQuery.py which can be used in the template example example_test_failure_github_issue.py along with the module CDashQueryAnalyzeReport.py which contains most of the heavy CDash querying functionality. Thus, the core work that will need to be done after utilizing the previously written modules will be to implement the algorithm that determines a random failure that is customizable on a project basis.

The goal will be for this tool to be able to look for randomly failing tests for any projects that posts their test results to CDash. The specifics of how this tool will gather the version information of the builds in CDash will be unique to each project and will require implementation on a project basis.

Ideally, this tool can be extended to analyze and report randomly failing configure, builds, and tests, however starting with randomly failing tests should lead to a similar framework that can be used for those other cases.

Requirements

~~posts a github issue upon identifying a randomly failing test~~ (TRILFRAME-614 requirement for any post starting with an email first)
be able to query cdash results over a period of time
all functionality is tested
usage is documented

bartlettroscoe commented 5 months ago

CC: @sebrowne

@achauphan , one thing that occurred to me is that this tool will need to allow the usage of build-name modifier to take in the build name from CDash and provide a name used to determine sequential builds for the Trilinos PR and nightly testing system. For example, all of the Trilinos build names have the prefix PR-<prID>-test- and the suffix -<jenkinsJobID> that must be removed from the build name to get the core build name. For example, the builds:

PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1731
PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1732

are sequence of the same build but CDash actually does not recognize that because the build names are different. To identify a related sequence builds, you need to at least remove the suffix -<jenkinsJobID> to give:

PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables
PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables

Then, if the target and topic branches are the same, and if a test goes from passing to failing, then you can classify this as a random test failure.

You can provide the means for adjusting the build names using the Strategy Design Pattern.

So the two areas of variability for such a tool that will be project-specific (and therefore need to be abstracted out and pulled in as Strategy objects) are:

How to extract the version of the project for the purposes of comparing builds. (In the case of Trilinos with merge commits, you can do that by concatenating the target and topic branch SHA1s scrapped from the configure output and put tino into a string like <sha1-target>-<sha1-topic>.) Then the Python code just needs to compare the string values for this "version" to determine if the versions are the same (and it does not matter how that "version" was constructed or even what it represents).
How to edit the build names so we can determine sequences of the same build configurations. (In the case of Trilinos, at least remove the suffix -<jenkinsJobID>.)

Those can be two separate strategy objects given to the Python class(es) that are doing the data processing and analysis.

bartlettroscoe commented 5 months ago

@achauphan and @sebrowne, the Trilinos PR that brings in TriBITS PR #603 is:

https://github.com/trilinos/Trilinos/pull/12741

We can work on further refactorings and feature enhancements later.

I can see were this may be useful for some metrics for other projects that submit to CDash so I will do those refactorings as needed.

TriBITSPub / TriBITS