EPIC: Provide AQA Test metrics per release

jiekang commented 8 months ago

This issue tracks the efforts to provide more metrics on the AQAvit test runs on a per release basis.

The project as a whole currently tracks some useful release metrics via scorecards by Shelley:

https://github.com/adoptium/adoptium/wiki/Adoptium-Release-Scorecards https://github.com/smlambert/scorecard

The release scorecard data is useful to understand how we well we are doing at meeting release targets and how that is trending across releases.

It would be nice to similarly provide data for test runs to help track the "health" of our test suite execution across releases (health of the tests & their execution, which can relate to the underlying infrastructure). As a connected note, this is also a piece in the larger end goal of highlighting opportunities to reduce the burden on triage engineers (e.g. by highlighting machine specific failures across releases in a different manner)

I imagine this to involve enhancing the existing Release Summary Report (RSR) which already contains most of the data (whether in the report itself or in links), and presenting it in a manner that connects the state across releases.

This proposal is open to all feedback. To start the discussion, I propose tracking AQAvit test execution data per release that is formatted to easily understand platform state across releases. This would contain:

Per release & per platform:
- For test targets and tests: number executed, passed, failed, disabled, & skipped (mostly in RSR already)
- Overall score for initial test run: % passed, number of automated reruns (mostly in RSR already)
- For failed tests targets, the number and list of test failures, and the failing machine (likely linked) (mostly in RSR already)
- Number of manual reruns performed (may be difficult to gather and involve adjusting triage process)

smlambert commented 8 months ago

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Related to this is the enhancement issue for RSR: https://github.com/adoptium/aqa-test-tools/issues/649

smlambert commented 8 months ago

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: https://github.com/adoptium/aqa-tests/issues/2037)
number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)
- for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)
if a related issue is identified to be the cause of the failure, report on the age of the issue
We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.
Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).
https://github.com/adoptium/aqa-test-tools/issues/649#issuecomment-1977361153

jiekang commented 8 months ago

Thanks @jiekang ! Adding some comments and linking some relevant issues related to this:

help track the "health" of our test suite

Noting an important distinction that some of the metrics under discussion will track the health of the underlying infrastructure and availability of machine resources rather than specifically the "health" of the test suites being run. We will aim to differentiate such information, in order to know where best to 'course-correct' and where to apply improvement efforts.

Yes, excellent point. There is some cross-boundary overlap as test execution success is sometimes closely linked to stable & consistent infrastructure configuration. I've updated the original comment to note this distinction as I think we should understand the health of the overall system.

Related to this is the enhancement issue for RSR: adoptium/aqa-test-tools#649

jiekang commented 8 months ago

Additional metrics worth tracking (not necessarily to be considered under this issue, just jotting some down so as not to lose them):

top-level test target execution times (including queue times waiting for machines), helps inform on whether extra resources needed (related: Assess test target execution time & define test schedule #2037)

number of online available test machines for particular platform (measured twice, once during dry run and once on the trigger of the release pipeline)

for mac where Orka machines are spun up, it would be good to measure the max burst number of dynamic agents spun up and running during release period (is that something visible in Orka dashboard, is there a limit set to how many we can spin up at once?)

if a related issue is identified to be the cause of the failure, report on the age of the issue

We already gather number of test targets excluded, but can also gather number of ProblemListed test cases as we enter a release period.

Number of commits between last aqa-tests release and current aqa-tests release (measuring contributions and activity in aqa-tests repository as a measure of overall AQAvit project health, could also measure the other 6 AQAvit repos, but aqa-tests is the central one, so a good measure of activity). Eclipse Foundation also tracks number of different companies that are giving contributions to a project, while we do not want to duplicate their statistics, we should consider reporting on those Organization contribution statistics in our program plan for each of the sub-projects, Temurin, AQAvit, etc (see https://projects.eclipse.org/projects/adoptium.aqavit/who).

EPIC: Improve the contents and organization of release summary report aqa-test-tools#649 (comment)

Thanks for noting all these; I can see value in all of them!

smlambert commented 8 months ago

Related to test execution stats gathering: https://github.com/smlambert/aqastats

Related to differentiating between infra issue and TBD issue (one that needs more triage to figure out if product|test|infra issue): There is a feature in test pipeline code that supports creating an errorList, and temporarily marking a machine offline if certain issues are reported to the console, this is one way to start to differentiate between what is an obvious infra issue and what is still an underdetermined issue that requires more triage to categorize it. We have not enabled it at ci.adoptium.net yet, but it would be an interesting experiment. A new route / API could potentially be added to TRSS to pull this data if present.

jiekang commented 8 months ago

Additional notes:

Tracking test effectiveness: Tracking when an actual bug in OpenJDK was found by a test
We must define the purpose of every metric and how it will be used to improve things

smlambert commented 8 months ago

As discussed in PMC call today, I will create a new repo to encompass moving over the scorecards scripts (from smlambert/scorecards, and an adapted version of scripts from smlambert/aqastats) and new metrics we will design and intend to add for all Adoptium sub-projects as shown in the Adoptium project hierarchy below:

Eclipse Adoptium®

jiekang commented 7 months ago

A first draft on data to be collected per release:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>

    Version:
    [...]
        SCM Ref: <scm_ref>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Execution Time: <total time>

            Platform:
            [...]
            OS & Arch: <os> <arch>
                Test Targets: <total, executed, passed, failed, disabled, skipped>
                Tests: <total, executed, passed, failed, disabled, skipped>
                Manual reruns: <total>
                Machines Available (Dry Run): <count>
                Machines Available (Release): <count>
                Execution Time: <total time>

                Test Target:
                [...]
                    Name: <name>
                    Execution Time: <total time>

jiekang commented 7 months ago

Clarifying: I think there is another set of data that has been discussed for gathering that doesn't fit into the same bucket, but is definitely still under consideration. E.g. Test effectiveness, Related issue reporting (age, etc.), repository activity, contribution statistics, etc.

jiekang commented 7 months ago

Also immediately after posting I think Platform and Version hierarchy should be swapped for Machines Available data to make sense.

smlambert commented 7 months ago

:)

Appreciate your initial care and thoughts on this feature @jiekang ! Thank you!

jiekang commented 7 months ago

So with the hierarchy flipped it is:

Release: 
[...]
    Date: <date>
    Execution Time: <total time>

    Platform:
    [...]
    OS & Arch: <os> <arch>
        Test Targets: <total, executed, passed, failed, disabled, skipped>
        Tests: <total, executed, passed, failed, disabled, skipped>
        Manual reruns: <total>
        Machines Available (Dry Run): <count>
        Machines Available (Release): <count>
        Execution Time: <total time>

        Version:
        [...]
            SCM Ref: <scm_ref>
            Test Targets: <total, executed, passed, failed, disabled, skipped>
            Tests: <total, executed, passed, failed, disabled, skipped>
            Manual reruns: <total>
            Execution Time: <total time>

            Test Target:
            [...]
                Name: <name>
                Execution Time: <total time>

jiekang commented 4 months ago

Just noting the code is in development here:

https://github.com/jiekang/scorecard/tree/trss-statistics

It's now fully functional with a diff command to compare between two releases.

Remaining items:

Fix issue with counting test targets. The input data is a list of test targets which include top-level target jobs, their children, and rerun jobs, all of which might not be correctly accounted for when summing results and duration. At the moment it blindly sums every piece of data available.
Add test total data. There are only test target totals.
Add machine availability data. At the moment, this can be manually input.

smlambert commented 3 months ago

Adding ideas:

Track number of Test job failures per pipeline run
- For example, this ea triggered pipeline has 20 test job failures out of 134 test jobs (includes the smoke test job in that total number), in other words 14.9% job failure rate.

YosiElias commented 1 week ago

/assign

liavweiss commented 1 week ago

/assign

adoptium / aqa-tests

EPIC: Provide AQA Test metrics per release #5121