adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
131 stars 314 forks source link

EPIC: Provide AQA Test metrics per release #5121

Open jiekang opened 8 months ago

jiekang commented 8 months ago

This issue tracks the efforts to provide more metrics on the AQAvit test runs on a per release basis.

The project as a whole currently tracks some useful release metrics via scorecards by Shelley:

https://github.com/adoptium/adoptium/wiki/Adoptium-Release-Scorecards https://github.com/smlambert/scorecard

The release scorecard data is useful to understand how we well we are doing at meeting release targets and how that is trending across releases.

It would be nice to similarly provide data for test runs to help track the "health" of our test suite execution across releases (health of the tests & their execution, which can relate to the underlying infrastructure). As a connected note, this is also a piece in the larger end goal of highlighting opportunities to reduce the burden on triage engineers (e.g. by highlighting machine specific failures across releases in a different manner)

I imagine this to involve enhancing the existing Release Summary Report (RSR) which already contains most of the data (whether in the report itself or in links), and presenting it in a manner that connects the state across releases.

This proposal is open to all feedback. To start the discussion, I propose tracking AQAvit test execution data per release that is formatted to easily understand platform state across releases. This would contain:

smlambert commented 8 months ago

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Related to this is the enhancement issue for RSR: https://github.com/adoptium/aqa-test-tools/issues/649

smlambert commented 8 months ago

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

jiekang commented 8 months ago

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Yes, excellent point. There is some cross-boundary overlap as test execution success is sometimes closely linked to stable & consistent infrastructure configuration. I've updated the original comment to note this distinction as I think we should understand the health of the overall system.

Related to this is the enhancement issue for RSR: adoptium/aqa-test-tools#649

jiekang commented 8 months ago

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

  • top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)
  • number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)

    • for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)
  • if a related issue is identified to be the cause of the failure, report on the age of the issue
  • We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.
  • Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).
  • EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

Thanks for noting all these; I can see value in all of them!

smlambert commented 8 months ago

Related to test execution stats gathering: https://github.com/smlambert/aqastats

Related to differentiating between infra issue and TBD issue (one that needs more triage to figure out if product|test|infra issue): There is a feature in test pipeline code that supports creating an errorList, and temporarily marking a machine offline if certain issues are reported to the console, this is one way to start to differentiate between what is an obvious infra issue and what is still an underdetermined issue that requires more triage to categorize it. We have not enabled it at ci.adoptium.net yet, but it would be an interesting experiment. A new route / API could potentially be added to TRSS to pull this data if present.

jiekang commented 8 months ago

Additional notes:

Related issue: https://github.com/adoptium/aqa-tests/issues/4278

smlambert commented 8 months ago

As discussed in PMC call today, I will create a new repo to encompass moving over the scorecards scripts (from smlambert/scorecards, and an adapted version of scripts from smlambert/aqastats) and new metrics we will design and intend to add for all Adoptium sub-projects as shown in the Adoptium project hierarchy below:

jiekang commented 7 months ago

A first draft on data to be collected per release:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>

    Version:
    [...]
        SCM Ref: <scm_ref>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Execution Time: <total time>

            Platform:
            [...]
            OS & Arch: <os> <arch>
                Test Targets: <total, executed, passed, failed, disabled, skipped>
                Tests: <total, executed, passed, failed, disabled, skipped>
                Manual reruns: <total>
                Machines Available (Dry Run): <count>
                Machines Available (Release): <count>
                Execution Time: <total time>

                Test Target:
                [...]
                    Name: <name>
                    Execution Time: <total time>
jiekang commented 7 months ago

Clarifying: I think there is another set of data that has been discussed for gathering that doesn't fit into the same bucket, but is definitely still under consideration. E.g. Test effectiveness, Related issue reporting (age, etc.), repository activity, contribution statistics, etc.

jiekang commented 7 months ago

Also immediately after posting I think Platform and Version hierarchy should be swapped for Machines Available data to make sense.

smlambert commented 7 months ago

:)

Appreciate your initial care and thoughts on this feature @jiekang ! Thank you!

jiekang commented 7 months ago

So with the hierarchy flipped it is:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>

    Platform:
    [...]
    OS & Arch: <os> <arch>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Machines Available (Dry Run): <count>
        Machines Available (Release): <count>
        Execution Time: <total time>

        Version:
        [...]
            SCM Ref: <scm_ref>
            Test Targets: <total, executed, passed, failed, disabled, skipped>
            Tests: <total, executed, passed, failed, disabled, skipped>
            Manual reruns: <total>
            Execution Time: <total time>

            Test Target:
            [...]
                Name: <name>
                Execution Time: <total time>
jiekang commented 4 months ago

Just noting the code is in development here:

https://github.com/jiekang/scorecard/tree/trss-statistics

It's now fully functional with a diff command to compare between two releases.

Remaining items:

smlambert commented 3 months ago

Adding ideas:

YosiElias commented 1 week ago

/assign

liavweiss commented 1 week ago

/assign