adoptium / aqa-test-tools

Home of Test Results Summary Service (TRSS) and PerfNext. These tools are designed to improve our ability to monitor and triage tests at the Adoptium project. The code is generic enough that it is extensible for use by any project that needs to monitor multiple CI servers and aggregate their results.
Apache License 2.0
28 stars 79 forks source link

allTestsInfo may be incorrect #860

Open sophia-guo opened 3 months ago

sophia-guo commented 3 months ago

@smlambert noticed that recent release run shows allTestsInfo may be incorrect for some builds.

Example : jdk22 release mac extended.openjdk job, which should have around 16+39+38 tests. But trss only shows 3 ( Pre and Post tests are not taken as tests).

Screenshot 2024-03-27 at 4 27 31 PM

https://trss.adoptium.net/allTestsInfo?buildId=65fb17d643ff67006ef89f92&limit=5&hasChildren=true

It happened to all jobs with rerun. Only tests of rerun will show. Expected behaviour is tests should combine original run and rerun.

Screenshot 2024-03-27 at 4 44 15 PM

This might be related with recent update in TKG https://github.com/adoptium/TKG/issues/510.

llxia commented 3 months ago

It happened to all jobs with rerun.

sanity.openjdk looks correct to me https://trss.adoptium.net/allTestsInfo?buildId=65fb17d643ff67006ef89f93&limit=5

image

smlambert commented 3 months ago

Indeed (so raised this issue to check what is happening in the extended jobs to be different from sanity), one thing is that its 3 child jobs under the extended.openjdk, versus sanity.openjdk whose results would be parsed from single console log.

sophia-guo commented 3 months ago

Actually seems hasChildren did the trick. It happens hasChildren=true. For extended.openjdk the link is https://trss.adoptium.net/allTestsInfo?buildId=65fb17d643ff67006ef89f92&limit=5&hasChildren=true. Sanity.openjdk hasChildren=false the link is https://trss.adoptium.net/allTestsInfo?buildId=65fb17d643ff67006ef89f93&limit=5&hasChildren=false. Might be worth to check how the flag or variable hasChildren is defined and changed.

llxia commented 3 months ago

From TRSS history, the Jenkins test build https://ci.adoptium.net/job/Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0/3/ got created on Sep 22, 2023. The root build was https://ci.adoptium.net/job/build-scripts/job/openjdk22-pipeline/69 (no longer exists).

Then Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0/3 got deleted from Jenkins. When the build triggered again, Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0/3 got re-created on Mar 19, 2024. However, TRSS has the old build history. TRSS thinks it is an update of the old record as the build exists in DB. As a result, Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0/3 is referenced/linked to openjdk22-pipeline/69 in TRSS, not the new build.

https://trss.adoptium.net/api/getData?buildName=Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0&buildNum=3

_id: "650d1a6de1aaa4007424f149",
url: "https://ci.adoptium.net/",
buildName: "Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0",
buildNameStr: "Test_openjdk22_hs_extended.openjdk_x86-64_mac_testList_0",
buildNum: 3,
rootBuildId: "650cd2f1e1aaa400742434fc",
parentId: "650d186de1aaa4007424e836",
type: "Test",
status: "Done",
...
timestamp: 1695342033261,
versions: { }

I think we had this situation before.

smlambert commented 3 months ago

Aw sucky. I am not sure we can guarantee that jobs won't be deleted underneath Jenkins. I remember answering the question "can we delete these", and saying yes, but I had assumed the deletion would happen in the Jenkins GUI, which would have meant that the job ID count would not have been lost. They must have been removed by logging on to the Jenkins server and deleting the workspaces.

Should we consider using a key that includes the parent IDs (for the TRSS DB index)? sigh...

sxa commented 3 months ago

remember answering the question "can we delete these", and saying yes, but I had assumed the deletion would happen in the Jenkins GUI

Ah gotcha - having read this this issue is potentially related to the removal of the testList jobs as per https://github.com/adoptium/infrastructure/issues/2774#issuecomment-1954462286 - that would make sense and would have reset the counters to zero as the jobs will have been regenerated on demand.

The issue was not the individual job runs (so removing those would have made no difference in terms of solving the problem) but the testList jobs themselves which needed to be regenerated as they were the cause of the parameter errors in the logs, which is why they were deleted as I wasn't aware of a way to refresh them all (They seemed to be causing warnings regardless of whether they were being invoked from what I could see). While in this case I did do the work directly on the filesystem to avoid a lot of clicking, I believe that deleting the job definition via the UI (which is what would have been necessary to force regen) would have had the same effect in terms of "losing" the last build number, since it would have removed all trace of the job including that information.

llxia commented 3 months ago

the testList jobs themselves which needed to be regenerated as they were the cause of the parameter errors in the logs, which is why they were deleted as I wasn't aware of a way to refresh them all (They seemed to be causing warnings regardless of whether they were being invoked from what I could see)

Just to clarify, if a Jenkins test job got regenerated, the previously executed job history will cause warnings? If this is the case, it should be a Jenkins issue. The previously executed job history should be static.