adoptium / temurin

Eclipse Temurin™ project assets
https://adoptium.net/temurin
19 stars 6 forks source link

General Retrospective for September and October 2024 Releases #54

Closed adamfarley closed 5 days ago

adamfarley commented 3 months ago

Summary

A retrospective for all efforts surrounding the titular releases.

All community members are welcome to contribute to the agenda via comments below.

This will be a virtual meeting after the release, with at least a week of notice in the #release Slack channel.

On the day of the meeting we'll review the agenda and add a list of actions at the end.

Invited: Everyone.

Time, Date, and URL

Time: 3-4pm UTC Date: Monday the 18th of November URL: https://meet.google.com/uwc-iwjn-rqm

Details

Retrospective Owner Tasks (in order):

TLDR

Add proposed agenda items as comments below.

andrew-m-leonard commented 2 months ago

build repo release branches don't have mandatory PR review, probably as settings regex does not match...?

andrew-m-leonard commented 2 months ago

build repo code freeze check for the release branch was not enabled, but then I thought, do we really need it, especially if we get the release branch mandatory review fixed?

andrew-m-leonard commented 2 months ago

Currently dryrun tags are the tag previous to the suspected actual GA tag, since it's not easy to "reset" the auto-trigger, maybe we ought to fix that...?

fyi, a bit naff!, but to do a trigger "reset" (since I had to do one for a failed dryrun trigger!) As a Jenkins "Admin":

andrew-m-leonard commented 2 months ago

getTestDependency was failing on temurin-compliance due to no authentication: https://github.com/adoptium/aqa-tests/issues/5589 This was failing in the July release as well, but failure of this stage does not fail the job.. which means we use the workspace cache, if we have one, and whatever maybe there!

smlambert commented 2 months ago

re: https://github.com/adoptium/temurin/issues/54#issuecomment-2344011663

This was failing in the July release as well, but failure of this stage does not fail the job.. which means we use the workspace cache, if we have one

Do not think there is anything in the dependencies list that gets used by the TC jobs (but could affect if we are using TC Grinder to verify AQAvit tests, though most dependencies do not change often, so cached versions are fine).

andrew-m-leonard commented 2 months ago

TRSS needs new JDK versions adding before release week, release-openjdk23-pipeline was missing.

SL/Sept12 - now added

andrew-m-leonard commented 2 months ago

We should be more accurate with our release process terminology: Publish updates to the containers to dockerhub should be: Publish docker images to dockerhub

sophia-guo commented 1 month ago

When doing the triage, the tap files of the grinder should be attached to the triage issue , for example https://github.com/adoptium/aqa-tests/issues/5598. So the job https://ci.adoptium.net/view/Test_grinder/job/TAP_Collection can collect tap files of pipeline job and tap files of grinder.

sophia-guo commented 1 month ago

For trss if rerun job passes the corresponding test job status should be set as pass, so no need to do the extra triage. For example https://trss.adoptium.net/resultSummary?parentId=66e2f744d24e1b006e88e097 aarch64_mac, extended.openjdk rerun passed, the extended.openjdk should set as success.

@adamfarley says: This issue has been raised here.

sophia-guo commented 1 month ago

AQA triage, using the auto generated rerun links of rerun test job, which has already prepopulated either failed test targets or failed test cases. https://ci.adoptium.net/job/Test_openjdk23_hs_extended.openjdk_x86-64_windows_rerun/19/

smlambert commented 1 month ago

For trss if rerun job passes the corresponding test job status should be set as pass, so no need to do the extra triage. For example https://trss.adoptium.net/resultSummary?parentId=66e2f744d24e1b006e88e097 aarch64_mac, extended.openjdk rerun passed, the extended.openjdk should set as success.

Quick checks to make when triaging, look at the rerun.tap file on the Jenkins job, if its green, nothing to do.

We should also have a different chiclet icon for this "state" where rerun job passes. Suggest a yellow chiclet with a small green circle in top right corner for that state and so forth. Related issue: https://github.com/adoptium/aqa-test-tools/issues/912

sophia-guo commented 1 month ago

There are almost no tests jobs were triggered by openjdk-pipeline or evaluation-openjdk-pipeline during September release ( i.e, ea build triggered nightly or weekly). As we set around 10 days before and 5 days after release as the no nightly tests job window. https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/trigger_beta_build.groovy#L53-L79, which might be fine with January, March, July and September releases. May not be good for October and April releases.

Due to the scheduling of releases in September and October, as well as in March and April, there is a potential overlap that could result in gaps in testing. Specifically, with releases in March and September, followed closely by April and October, there may be minimal time available for comprehensive testing between those consecutive releases. As a result, critical tests may be rushed or omitted, impacting the stability of those releases. For example, reproducible comparing tests on linux are updated in Sep 6th and after that the test was only run once with jdk24 by Oct2.

andrew-m-leonard commented 1 month ago

There are almost no tests jobs were triggered by openjdk-pipeline or evaluation-openjdk-pipeline during September release ( i.e, ea build triggered nightly or weekly). As we set around 10 days before and 5 days after release as the no nightly tests job window. https://github.com/adoptium/ci-jenkins-pipelines/blob/master/pipelines/build/common/trigger_beta_build.groovy#L53-L79, which might be fine with January, March, July and September releases. May not be good for October and April releases.

Due to the scheduling of releases in September and October, as well as in March and April, there is a potential overlap that could result in gaps in testing. Specifically, with releases in March and September, followed closely by April and October, there may be minimal time available for comprehensive testing between those consecutive releases. As a result, critical tests may be rushed or omitted, impacting the stability of those releases. For example, reproducible comparing tests on linux are updated in Sep 6th and after that the test was only run once with jdk24 by Oct2.

To add some extra info, for example jdk-21.0.5+7 and +8 EA builds both landed during the Sept release "disabled test" period, jdk-21.0.5+6 EA was the last build run with tests prior to release, and jdk-21.0.5+9 after: image

smlambert commented 1 month ago

October release

andrew-m-leonard commented 1 month ago

October: Care needs taking when publishing binaries to check if a platform was rebuilt, for example both jdk17 macAarch64 and jdk17 pLinux were rebuilt, but binaries were still present on the original pipeline. Mac was initially published from the wrong one.

Can we remove bad build artifacts? when we rebuild...

andrew-m-leonard commented 1 month ago

October, we forgot to publish JDK11 aarch64 mac even though it had been finished for several days

andrew-m-leonard commented 1 month ago

status by platform document https://github.com/adoptium/temurin/issues/60 is not always being updated... I think we need to automate this, it's too easy to forget or update wrongly

andrew-m-leonard commented 1 month ago

misstakes were made in selecting publish job links, meaning a platform didn't get published when we said it was, due to clicking on WindowsX64 rather than Windowsx32...

smlambert commented 1 month ago

aarch64 windows was added as a platform for jdk21 and jdk23, but there were several changes required for it to be ready.

This could have happened well ahead of the release period (as per the plan discussed in past PMC mtg), it could have also been seen during a dry run, but no dry run was performed (were other checklist items not completed, seemed the release champion was not always present and in that event missed the opportunity to communicate that to others and ensure tasks were delegated).

andrew-m-leonard commented 4 weeks ago

We need to invest resource in making the Installers publishing a lot better and automated. In its current form it mentally scars you !!

sophia-guo commented 3 weeks ago

https://github.com/adoptium/aqa-tests/issues/5692#issuecomment-2429722283

Some arm32 jdk8 tests used to work on non-containers agents. Seems we don't have them any more https://ci.adoptium.net/label/ci.role.test&&sw.os.linux&&hw.arch.aarch32/. If the tests can only pass on non-containers we might need to do a vendor exclude due to our eclipse machine farm having limitations. https://github.com/adoptium/aqa-tests/blob/master/openjdk/excludes/vendors/eclipse/ProblemList_openjdk8.txt

andrew-m-leonard commented 3 weeks ago

I think this release has demonstrated the necessity of a dry-run, but also the issue with the "installers" and the new Azure VMs demonstrates the need for a dry-run installers upload possibly?

sxa commented 2 weeks ago

NOTE: Proposal to move the releasing guide to a wiki in either the build or one of the top level repositories: https://github.com/adoptium/temurin-build/pull/3993

sxa commented 2 weeks ago

Memo to self: Discuss deadlock potential with x64 nodes during installer process.

adamfarley commented 5 days ago

Actions

Raise issues (@adamfarley)

Other actions: Andrew:

Someone:

Other data:

adamfarley commented 5 days ago

Next retrospective - https://github.com/adoptium/temurin/issues/64