cdash_analyze_and_report.py: Add history for tests without issue trackers failing over many days, not just one day

bartlettroscoe commented 3 years ago

CC: @e10harvey

This Story is to scope out (and possibly implement) an extension to the cdash_anayze_and_report.py tool that lists out all of the tests without issue trackers with test history for a set of tests that have failed at least once over some previous time period but are passing for the reference testing day (provided with the --date argument). For example, test [<site>, <buildname>, <testname>] that does not yet have an issue tracker associated with it may be passing for the reference --date=YYYY-MM-DD may have (randomly?) failed 3 times over the previous 7 days . The current implementation of cdash_anayze_and_report.py would not display any information about that test. If someone was not looking at the emails for every day in the previous 7 days, they should not notice this (random) test failure.

Therefore, this story is to add a feature to cdash_anayze_and_report.py that lists out the test [<site>, <buildname>, <testname>] with its test history that fails in the last X days but not the current day. This would need to include tests that pass or are missing (perhaps because their builds are missing) for the reference testing day.

Motivating Customer: The Trilinos Framework team wants to move to a process where a set of "Secondary" builds are triaged less frequently. To do that, they want to get one email at the end of every week that also includes information on the tests that failed in the last week but not the reference testing day at the end of the work.

Caused by: https://sems-atlassian-son.sandia.gov/jira/browse/SEPW-281.

Related to:

SEPW-215

bartlettroscoe commented 3 years ago

Proposed Solution 1

One proposed solution would be to add an argument called something like:

--show-history-for-tests-without-issue-trackers-failed-in-last-x-days=<num-days>

Then, any tests that failed at least once in the last <num-days> would be listed with their test history in a table called something like:

Tests without issues trackers recently failed (at least once last <num-days> days, limited to <max-rows>): twoirf=<twoirf>

For example, for the Trilinos Secondary builds running with:

--show-history-for-tests-without-issue-trackers-failed-in-last-x-days=7

the email sent out would show a table like:

Tests without issue trackers recently failed (at least once last 7 days, limited to 300): twoirf=2

Site	Build Name	Test Name	Status	Details	Days since last Failed	Non-pass Last 30 Days	Pass Last 30 Days	Issue Tracker
ride	Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug	KokkosCore_UnitTest_CudaInterOpInit_MPI_1	Passed	Completed	3	7	17
cee-rhel7	Trilinos-atdm-cee-rhel7_cuda-10.1.243_gnu-7.2.0_openmpi-4.0.3_shared_dbg	KokkosCore_UnitTest_CudaTimingBased_MPI_1	Missing	Missing	4	2	15

This table would be listed right below the table:

Tests without issue trackers Failed (limited to <max-rows>): twoif=<twoif>

so it would appear near the top of the email in the summary paragraph and in the list of tables.

NOTE: The "Consecutive" column in the other test tables replaced with the column "Days since last Failed" which gives a link to the most recent failure pf the test.

NOTE: In this table "Failed" means tests that had CDash test status="Failed" and not status="Not Run". Therefore the number in the column "Non-pass Last 30 Days" would be for tests with status="Failed" and status="Not Run". (Hopefully that will not be to confusing. If it is, we could replace "Failed" with "Non-pass" to be consistent. But since this table 'twiorf' says "Failed", perhaps that is not confusing?)

Also, it is possible that the table 'twoirf' could replace the table 'twoif' and show all tests that failed over the last <num-days> tests, including the tests failing for the current/reference testing day. But that would mean that you would not see the column "Consecutive Non-pass Days" so you would loose the information (which is critical to see that a test is failing every day and is not random). Also, by loosing the table 'twoif' we don't get an actual accounting of how many tests actually failed for the current/reference testing day so that may not be a good idea to remove the table 'twiof'.

Proposed implementation

The implementation would really not be that complicated. What you would do is to is to take the cdash/queryTests.php URL and replace date=YYYY-MM-DD with begin=<begin>&end=YYYY-MM-DD where <begin> is <num-days> before the reference date YYYY-MM-DD and download the non-passing test data from CDash for that (in addition to the cdash/queryTests.php data with date=YYYY-MM-DD).

To get the list of tests using that cdash/queryTests.php URL (only showing Failed tests) to show in the table "twoirf" the tool would:

Get a unique list of tests [<site>, <buildname>, <testname>] over that <num-days> days (using existing tested function getUniqueTestsListOfDicts() that needs to be moved into CDashQueryAnalyzeReport.py or a specialized version of that function).
From that unique list of tests [<site>, <buildname>, <testname>] we would select those that:
- Failed at least once in the last <num-days> days, and
- Did not already have an issue tracker associated with it (because it will already be listed in the tables "twip" or "twim"), and
- Was not failing for the current testing day (because it will already be listed on the table "twoif")

A few simple filters can do that easily. And it is that list that becomes the list for "twoirf" for which the tool would show in that table.

NOTE: We need to limit the number of tests shown in the table "twoirf" to <max-rows> or it could get test history and show thousands of tests for a really bad day in the previous <num-days> days.

NOTE: I don't think we want to show "Not Run" tests in the table "twoirf" because that would pollute the table if one or more builds has massing build errors over the last <num-days> days. I think we are only interested in failing tests is that table. If there is persistent build error that results in "Not Run" tests, then we will see that in the table "Tests without issue tracker Not Run: twoinr=???" for the current testing day.

NOTE: To build the URL for this list of tests robustly to also filter out know system failures (that is passed to the cdash/queryTests.php filer), I think we should first implement and use the new input arguments in #348. But is is not strictly needed since we can filter out "not run" tests by just taking the input URL for cdash/queryTests.php and replacing date=YYYY-MM-DD with begin=<begin>&&end=YYYY-MM-DD. In fact, that is the easier solution.

NOTE: To get the list of tests that failed in the last <num-days> but not today, it may be easier and cheaper to split the list of filing tests that failed in the last 7 days from the second cdash/queryTests.php query using begin=<begin>&end=YYYY-MM-DD because to do that, we only need to split that sublist based the status test field (i.e. those with status=Failed goes into the table 'twoif' and those with status=Pass or status=Missing going into the table 'twoirf'). Other implementation approaches may be used as well but we need to be careful to keep down the algorithmic complexity of this operation for worst case scenarios (or the tool will take a long time if there are thousands of failing tests either for the current testing day or over the last <num-days> days).

NOTE: This will be somewhat hard to write tests test for since it will require some dummy test data for this use case. But hopefully that will not be too hard to manufacture for current set of reference builds and tests used in the automated testing. (But writing system-level tests for this tool is always harder than writing the production code but we get very strong tests by doing so.)

bartlettroscoe commented 3 years ago

@e10harvey, please comment on the proposed solution above and to SEPW-281. I think that is not too hard to implement.

e10harvey commented 3 years ago

@bartlettroscoe: Thanks for looking into this. I like proposed solution 1 above. Thinking about this more now -- in an effort to reduce the size of the emails and provide a general solution, would it be possible to adjust the 'Failed' link under the 'Status' column of the existing tables to link to the most recent failure within the given begin and end dates of the input URL for cdash/queryTests.php? To improve usability, if the input URL spans more than one day, could we make the link text 'Failed on <date>' as well as append ' to <end-date>' to the subject line. Would this be difficult to implement?

bartlettroscoe commented 3 years ago

Would this be difficult to implement?

@e10harvey, I am not entirely sure what is being suggested above so we should chat through this. Can we set up a short meeting offline?

bartlettroscoe commented 3 years ago

Proposed Solution 2

What @e10harvey is suggesting is updating the table 'twoif' to include the most recent failure within the date range of --show-history-for-tests-without-issue-trackers-failed-in-last-x-days=7. If the cdash_analyze_and_report.py input arguments span more than one day, we could make the link text 'Failed on <date>' as well as append ' to <end-date>' to the subject line. Here is an example of what the table would look like:

Tests without issue trackers failed: twoif=2

Site	Build Name	Test Name	Status	Details	Consecutive Pass Days	Non-pass Last 30 Days	Pass Last 30 Days	Issue Tracker
ride	Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug	KokkosCore_UnitTest_CudaInterOpInit_MPI_1	Failed on 2021-01-21	Completed (Failed on 2021-01-21)	3	7	17
cee-rhel7	Trilinos-atdm-cee-rhel7_cuda-10.1.243_gnu-7.2.0_openmpi-4.0.3_shared_dbg	KokkosCore_UnitTest_CudaTimingBased_MPI_1	Failed on 2021-01-17	Completed (Failed on 2021-01-17)	2	2	15

e10harvey commented 3 years ago

@bartlettroscoe: I updated your comment and deleted a duplicate of that comment.

bartlettroscoe commented 3 years ago

@e10harvey, it just occurred to me that the "Consecutive ??? Days" column this is ill-defined in this new 'twoirf' table since we are now mixing tests that could be passing, missing, or not-run (instead of failing) for the current/reference testing day.

One idea to address your desire to provide a link to the most current failure and to address the ill-defined "Consecutive ??? Days" column is to replace that "Consecutive ??? Days" column with "Days since last Failure" with a number and a link to that most recent failure. For that, I updated "Proposed Solution 1" above with that replaced column with the correct link for this example. That would not be too hard to implement and hopefully be pretty clear and still provide a very compact table.

Let's discuss.

bartlettroscoe commented 3 years ago

@e10harvey, let's discuss later today but some issues with your Proposed Solution 2 compared to Proposed Solution 1 are:

The table 'twoif' in "Proposed Solution 2" still contains the column "Consecutive Pass Days" which makes no sense in this context and needs to be addressed in some way.
By listing Failed on YYYY-MM-DD instead of the current status for the current testing day, the data in the test dict for the "Status" and "Details" field is actually being corrupted and changes the meaning of those fields. That will break some internal code. This would make any other analysis of these tests (either by this tool or another tool that works on exported data) be unable to function if they need the current status of these tests. (This could be addressed by replacing the columns "Status" and "Details" with new fields and new columns in this table and leaving the test dict "Status" and "Details" field alone.)
By listing only Failed on YYYY-MM-DD one actually can't tell if that test is currently passing, failing, missing, or not-run (unless it is failing on the current testing day but giving the date YYYY-MM-DD does not make that clear unless you realize that that is the current reference testing day). (This is really related to the above but at the user level and not just the data-structure level.)
Replacing short fields like Passed and Completed with longer fields like Failed on YYYY-MM-DD and Completed (Failed on YYYY-MM-DD) this will force many rows to wrap lines that may not have wrapped before and make the tables harder to read.
Calling the table Tests without issue trackers failed: twoif=2 but having it now contain tests that are actually passing or missing does not makes sense at face value. If you re-reinterpret "failed" to mean "failed in the last 7 days" when passing in the argument --show-history-for-tests-without-issue-trackers-failed-in-last-x-days=7 then fine, but there is no hit at all in the table produced itself that that is what "failed" means at all. Alternatively, the table in "Proposed Solution 1" with the name Tests without issue trackers recently failed (at least once last 7 days, limited to 300): twoirf=2 is unambiguous and does not rely on the user having to know that the argument --show-history-for-tests-without-issue-trackers-failed-in-last-x-days=7 was being used or not.
Having the separate tables 'twoif' and 'twoirf' in "Proposed Solution 1" provides extra bits of info directly in the tables (without having to click on any links):
- Number of consecutive days that a currently failing test is failing (i.e. making it clear if this test fails every day or if it might be failing randomly).
- How long has it been since a test that is not currently failing last filed (i.e. by using an integer instead of having to mentally subtract the reference testing day date with the testing day date listed in the table in "Proposed Solution 2").

NOTE: Once one starts clicking on links (especially the history links) then all of this information will become clear in either proposal so it is a matter of keeping the data-structures correct and having the data be accurate and clear displayed in the generated emails.

bartlettroscoe commented 3 years ago

The experiences with https://github.com/trilinos/Trilinos/issues/8759 where someone thought that the random tests were not occurring anymore (because they all happened to pass the last few Sundays but did not look at the "Nonpassing tests last 30 days" column) suggest that as part of this we should also consider adding an option:

--show-history-for-tests-with-issue-trackers-failed-in-last-x-days=<num-days>

Then, any test with issue tracker that failed at least once in the last <num-days> would be listed with their test history in a table called something like:

Tests with issues trackers recently failed (at least once last <num-days> days): twirf=<twirf>

And actually this would be more useful to be displayed in GitHub Issue comments for the Grover tool. For issue trackers that have tests failing regularly (i.e. every day) then we might set <num-day> to be something very short like 3. But for tests that randomly failing, we might set this to <num-days>=30. So it would make clear that tests for that issue tracker have failed in the last 30 days.

But to make this must effective, those test failures need to be very specific and need to take into account some expected regexes of test output (and therefore we need to implement the 'expected_fail_regex' field, see TrilinosATDMStatus/TODO.txt).

This makes things more complicated but I think is really needed if people don't carefully look at the existing test-history tables for randomly failing tests.

e10harvey commented 3 years ago

@ZUUL42, @jwillenbring, @prwolfe, @william76: Please thumbs up one of the following comments to indicate your preference:

https://github.com/TriBITSPub/TriBITS/issues/349#issuecomment-767201303
https://github.com/TriBITSPub/TriBITS/issues/349#issuecomment-768426263
This comment if you have no preference

jwillenbring commented 3 years ago

I would be most curious to see if @zuul42 or @prwolfe have a preference. I do personally like @bartlettroscoe 's comment about displaying the more detailed info using Grover, but that is just because of my typical interaction with the failures.

bartlettroscoe commented 3 years ago

And actually for reporting by Grover to the GitHub Issue, you would need to be careful about missing test results from missing builds. If a build is not reporting tests for several days but just happened to report the day before, we don't want to give the impression that the tests have been passing for the last <num-day> days just because it did not fail in the last <num-days> days.

TriBITSPub / TriBITS