adoptium / temurin

Eclipse Temurin™ project assets
https://adoptium.net/temurin
19 stars 6 forks source link

General Retrospective for March/April 2024 Releases #28

Closed adamfarley closed 4 months ago

adamfarley commented 8 months ago

Summary

A retrospective for all efforts surrounding the titular releases.

All community members are welcome to contribute to the agenda via comments below.

This will be a virtual meeting after the release, with at least a week of notice in the #release Slack channel.

On the day of the meeting we'll review the agenda and add a list of actions at the end.

Invited: Everyone.

Time, Date, and URL

Time: 3pm BST, 10am EST. Date: Tuesday the 7th of May, 2024. URL: https://eclipse.zoom.us/j/82423919203?pwd=jcs9cimNWYIflSqChjnT5U5Aj62sSx.1 Meeting ID: 824 2391 9203 Passcode: 339984

Details

Retrospective Owner Tasks (in order):

TLDR

Add proposed agenda items as comments below.

smlambert commented 6 months ago

Scary msg when publishing aarch64_mac binaries, believe it did the right thing (pushing 31 artifacts to releases repo, but found 62 and reported UNSTABLE https://ci.adoptium.net/job/build-scripts/job/release/job/refactor_openjdk_release_tool/8424/

dryrun did not indicate any issues to be aware of https://ci.adoptium.net/job/build-scripts/job/release/job/refactor_openjdk_release_tool/8423/

sophia-guo commented 6 months ago

https://github.com/adoptium/temurin/issues/28#issuecomment-2010423578

It happened to all platforms.

The status is unstable as it's counting some of the -ea tagged artifacts, which means the check file needs to update. Maybe we should think is there a better way or automate way to check the numbers?

jerboaa commented 6 months ago

Something went wrong with publishing source tarballs for the Jan 2024 update. See: https://github.com/adoptium/adoptium-support/issues/1003 we should make sure that it's there with some verification for any release.

sophia-guo commented 6 months ago

Also to publish the binary you can try the rerun link at the lower part of page https://ci.adoptium.net/job/build-scripts/job/release-openjdk22-pipeline/5/, which is enabled for this release.

Dry runs are triggered by pipeline job itself. If dry run fails the rerun link would be Dry run RELEASE Publish temurin jdk-22+36 mac x64, which will trigger a dryrun. Otherwise the rerun link would be RELEASE Publish temurin jdk-22+36 mac x64 and no need to do the dryrun.

Screenshot 2024-03-20 at 3 44 29 PM
smlambert commented 6 months ago

Also to publish the binary you can try the rerun link

That is what I clicked to run first the dry run (8423), then the release run (8424). I expected the release checks that we have in place to work, but I guess they do not take into account the presence of the EA artifacts.

By the way, I very much LOVE having the release links available, as I will never get the regex wrong again! Now we just need to update the checks to be a bit more specific and handle gracefully or remove the EA artifacts.

smlambert commented 6 months ago

With new feature release, need to ensure aqa-tests JCK configs updated (and likely shift to using a template that does not require duplicating configs for each new version).

smlambert commented 6 months ago

With new feature release, check if Version List on website needs updating? https://github.com/adoptium/adoptium.net/issues/2731

sophia-guo commented 6 months ago

That is what I clicked to run first the dry run (8423), then the release run (8424).

A little bit of confused. Dry run was triggered by pipeline and succeeded. Is 8423 for double check?

smlambert commented 6 months ago

A little bit of confused. Dry run was triggered by pipeline and succeeded. Is 8423 for double check?

Yes, and also because I did not find the output from the original dry run quickly

sophia-guo commented 6 months ago

Would it be helpful to add the dry-run build links?

smlambert commented 6 months ago

Downside of everyone and their dog using Grinders for all kinds of good work is that it is more difficult to spot the ones launched to complete release triage. (no action required on this comment, good bookkeeping can cover it, or we advise folks doing dev work to use Grinder_Dev, or some such).

smlambert commented 6 months ago

Would it be helpful to add the dry-run build links?

I think the dryruns for publishing were there to help deal with the great potential for human error, which is essentially removed by the addition of the prepopulated 'quick links' found at the bottom of the parent pipeline job.

If the dryruns did some of the verification that happens during the actual publish (checking the right number of artifacts exist), and subsequently fail or report if there was an issue, they would have a purpose. Without the verification checks happening in the dry run, there seems little need to run an automated dry run.

smlambert commented 6 months ago

The release has been chugging along smoothly enough to allow for development work to continue alongside of it. The unfreezing of master branches in Temurin project also a good thing.

smlambert commented 6 months ago

Triggering the pipeline ahead of the -ga tag was a good call. The upstream -ga tag did not show up until much later (~full day) than we triggered the pipeline (Wed/20th).

smlambert commented 6 months ago

https://github.com/adoptium/aqa-tests/issues/5156#issuecomment-2025508317 - for follow-up AQAvit actions

smlambert commented 6 months ago

Release notes not being served up via API (so not showing up on the website). Slack msg, am I missing a step that I do not find instructions for?

smlambert commented 6 months ago

Why is EclipseMirror job on the TC Jenkins server?
Think we can hook that checklist action into a post build job, along with many other tasks. (related: https://github.com/adoptium/ci-jenkins-pipelines/issues/610#issuecomment-2027151112)

smlambert commented 6 months ago

If we are clearing out / deleting jobs on Jenkins, let's do it via Jenkins API (which then keeps the JobID history), not by deleting workspace directly on the server (where JobID history not kept and ID count starts again, causing repeated/duplicated IDs). This then leads to a problem with seeing the new runs on TRSS that uses those jobIDs as indices in the DB.

sxa commented 6 months ago

Why is EclipseMirror job on the TC Jenkins server? Think we can hook that checklist action into a post build job, along with many other tasks. (related: adoptium/ci-jenkins-pipelines#610 (comment))

Because it contains a secret credential that we were only allowed to have on the eclipse-managed server.

sxa commented 6 months ago

If we are clearing out / deleting jobs on Jenkins, let's do it via Jenkins API (which then keeps the JobID history), not by deleting workspace directly on the server (where JobID history not kept and ID count starts again, causing repeated/duplicated IDs). This then leads to a problem with seeing the new runs on TRSS that uses those jobIDs as indices in the DB.

See comment on https://github.com/adoptium/aqa-test-tools/issues/860#issuecomment-2034589098 - I don't believe that deleting the job definition via any means would have met the requirements here and retained those identifiers .

sxa commented 5 months ago

[Update Made]

Guide to creating new mirror releases should including archiving the non-u release when u mirror is created.

sxa commented 5 months ago

[Update made]

The steps in https://github.com/adoptium/temurin-build/wiki/Creating-new-jdkNNu-(updates)-repro-mirror-from-the-jdkNN-release-mirror#how for manually doing initial population of the mirror should not be required based on this slack thread so the docs should reflet that (and possibly put the manual steps in a <summary> section. Noting also that after the mirror is created it can take up to three hours for the permissions to start working if you need to do it manually. Also as per the thread you may see this error the first time you run a new mirror job to populate the repository which will cause the job to fail - it seems to go through ok on a second attempt to run the mirror job:

+ git rebase skara/master master
fatal: no such branch/commit 'master'
Build step 'Execute shell' marked build as failure
sxa commented 5 months ago

[UPDATE MADE]

Also the doc on creating the generator should explicitly state that while the title of the job should have the u as appropriate, the JAVA_VERSION should NOT have the u as it gets added later and will result in this: hudson.remoting.ProxyException: java.nio.file.NoSuchFileException: /home/jenkins/workspace/build-scripts/utils/evaluation-pipeline_jobs_generator_jdk22u/pipelines/jobs/configurations/jdk22uu_pipeline_config.groovy

ALSO: Memo to self: release-pipeline-generator kicks off regen of the top level release-openjdkXX-pipeline jobs before initiating each of the versioned ones underneath it (sequentially) so they don't need to be done separately

sxa commented 5 months ago

[PR link]

Add note to the checklist or RELEASING.md to indicate that we generally use the same AQA branch for March+April and for Sept+Oct *Subject to updating the JDxx_BRANCH name for the "new" March/Sept release)

sxa commented 5 months ago

Noting that on my initial attempt to set the dryrun tags I made it jdk-17.0.11+7-dryrun-ga (i.e. including the +7 build identifier. This will cause problems and such a tag will need to be deleted prior to running the pipelines or you'll get something like this from openjdk_pipeline.groovy:

[INFO] Resolved jdk-17.0.11-dryrun-ga to upstream build tag jdk-17.0.11+6jdk-17.0.11+6-dryrun-ga
[Pipeline] echo
[ERROR] scmReference does not match with any JDK branch in testenv.properties in aqa-tests release branch. Please update aqa-tests v1.0.1-release release branch. Set the current build result to FAILURE!

Also noting that if the pipeline does fail after being triggered, the workspace/tracking file on the jenkins worker node will need to be manually updated or you won't be able to re-trigger as the job uses that for its status and will not re-trigger the same underlying tag twice unless it's manually fixed.

EDIT: I'm not sure why but the jdk-17.0.11+6-dryrun-ga seemed to re-appear and caused the same issue when I did a second dry-run. Same happened for other releases I'd done it for. EDIT: It was because the jenkins workspace machine still had a cache of the old tag so we were pushing it on every update

sxa commented 5 months ago

The new u release (jdk22u) does not have any tags which can be used as an equivalent of the -dryrun-ga release, therefore we need to put in a fix to allow the dryrun process to run on jdk22u (and the same for subsequent STS releases). At the moment I'm going to use 20.0.0+0 because that will never be used, but we should consider what to do for future versions, since it will not be as simple to insert something similar between 20.0.1.x and 20.0.2.y Ref: https://github.com/adoptium/mirror-scripts/pull/50

_EDIT: Noting that we also require a corresponding jdk-20.0.0+0_adopt tag to be created, but NOT on the same commit otherwise you hit the issue from the previous comment. If you have to retag because you made it the same, be sure that the mirror jobs do not have the cached version of the old tag, as it will cause a failure._

_EDIT: We got a failure in the create_installerwindows job which said SOURCE Dir not found / failed (longer snippet below). From @andrew-m-leonard "I believe that will be because the tag “jdk-22.0.0+0” does not meet the expected format for jdk-22.0.1, we are building jdk22u HEAD which is 22.0.1 which is what the version string will be and is what the installer build expects…

looking for .\SourceDir\OpenJDK-Latest\hotspot\x64\jdk-22.0.1+null
SOURCE Dir not found / failed
Listing directory :
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK22
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK22\hotspot
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK22\hotspot\x64
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK22\hotspot\x64\jdk-22.0.0+0
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK22\hotspot\x64\jdk-22.0.0+0\bin
sxa commented 5 months ago

It seems that the code behind the *-openjdk22-pipeilne groovy scripts prefers to pick up a non-u configuration. I added 22u in the configurations dir and hadn't removed 22, so the dryruns triggered the jdk22- jobs using the jdk22 repository instead of the right one. Fixed by removing the jdk22 files from the configurations dir in this PR but we should consider whether prefering the non-u version is correct (It's hard to envision a scenario where it would be IMHO) but we should definitely cover this in the wiki page to ensure that the new u releases are done with a rename instead of creating new ones alongside the non-u versions.

sxa commented 5 months ago

[PR link - code freeze won't work with +NN so .NN is the correct one to use]

Should we use vYYYY.MM.NN or vYYYY.MM+NN for the build branches? The Releasing guide is ambiguous (which resulted in me I ended up with both at one point): image

sxa commented 5 months ago

Mirror scripts, if left to their own devices to populate a new u repository, do not include the README.JAVASE marker (and potentially not any other patches) in the dev/release branches.

EDIT: Fixed by https://github.com/adoptium/mirror-scripts/pull/51

sxa commented 5 months ago

Code freeze template for slack needs to be adjusted to indicate that not all build repositories are affected by it.

sxa commented 5 months ago

We should look at the criteria we use for disabling tests on the automated regular runs to ensure they do not cause problems with the dry runs: https://github.com/adoptium/ci-jenkins-pipelines/issues/1007

(Noting that this cycle was a bit special as I was trying to run all five releases in parallel to check for capacity and timings)

sxa commented 5 months ago

The sign_verification job is set to run on a test machine, and can therefore incur delays during a released cycle while such a machine becomes available (could be a while if it's the first released version and it gets queued up behind test jobs for other versions) There are delays of 4-5 hours for some jobs at the time of writing. This job also seems to run outside a pipeline "block" in the jdkXX--pipeline job so it may not have been too obvious previously that it was holding things up.

sxa commented 5 months ago

The release checklist has sections for the JDK8, 11 and "XX" releases - we should explicitly add 17 and 21 LTS releases in there too.

andrew-m-leonard commented 5 months ago

The sign_verification job is set to run on a test machine, and can therefore incur delays during a released cycle while such a machine becomes available (could be a while if it's the first released version and it gets queued up behind test jobs for other versions) There are delays of 4-5 hours for some jobs at the time of writing. This job also seems to run outside a pipeline "block" in the jdkXX--pipeline job so it may not have been too obvious previously that it was holding things up.

We should move the signVerify() to before the kicking off of the aqaTests, so it is done first

adamfarley commented 5 months ago

We should make sure that Aqa-Test triage issues ([example]|(https://github.com/adoptium/aqa-tests/issues/5213)) have the "comments" links (in the description) deleted if the platform table is being copied from elsewhere.

This is to prevent cases where comment links take a person to the issue for a previous release.

sxa commented 5 months ago

[PR link]

The checklist currently has this:

_Calculate the "expected" openjdk build tags for the releases being published, and update all the JDKnnBRANCH values in the testenv.properties

This will need to be updated in light of the updates that allow the correct tag to be autodetected. At present you do still need to have it set to the dryrun-ga ones for the dry-run process, but then switch to -ga for release time.

sxa commented 5 months ago

[PR link]

Checklist updates:

sxa commented 5 months ago

We should clean up the old git-hg jobs from https://ci.adoptium.net/job/git-mirrors/ as they are only likely to cause confusion now. Do we even need the adoptium subfolder?

sxa commented 5 months ago

Add item to process to ensure that the next release cycle is listed in the adoptium calendar so that EF are aware

sxa commented 5 months ago

[PR link]

For jdk8u/arm32 in the releasing guide:

andrew-m-leonard commented 5 months ago

New ready done Publish link very useful

sophia-guo commented 5 months ago

No Orka triggered with label hw.arch.x86&&(sw.os.osx||sw.os.mac)&&ci.role.perf&&!sw.os.osx.10_14, which is required for perf jdk22.

[hw.arch.x86&&(sw.os.osx||sw.os.mac)&&ci.role.perf&&!sw.os.osx.10_14](https://ci.adoptium.net/label/hw.arch.x86&&%28sw.os.osx%7C%7Csw.os.mac%29&&ci.role.perf&&!sw.os.osx.10_14)

Passed on 13th and before on an orka system. No orka available for this release.

smlambert commented 5 months ago

Is there a naming convention that one should follow for build branches for release?

Screenshot 2024-04-21 at 5 52 43 PM

Also, are we freezing the release branches or still freezing master?

sxa commented 5 months ago

@smlambert Naming is covered in a previous comment for discussion as the doc is currently ambiguous

sxa commented 5 months ago

Notes on blog post production:

EDIT: Slack message from Shelley indicates that it's a manual lift from the appropriate page under the upstream advisories page for now so that should be used for the guides

sxa commented 5 months ago

Ensure we define processes for the installers for:

smlambert commented 5 months ago

Should we be creating an issue or PR for the "next" release? (Checklist says PR, but we created an issue for this one)

I originally put PR, but think we should change it to issue, as it does what we need it to do (be a placeholder for 'new and noteworthy' notes between release period, and also does not tie the originator of the PR into being the next blog post author (pen named PMC).

sxa commented 5 months ago

I originally put PR, but think we should change it to issue

Noting that I'm planning to avoid creating either just now until we do the retrospective and have the discussion on this (I'm writing this to remind me that I need to do it when we're going through these comments ;-) )

sxa commented 5 months ago

"Full update" on the website had to be forced to pick up the new release notes (slack thread). Do we know why? Anything we can fix / document for future understanding.

sxa commented 5 months ago

Release notes seem to repeatedly have problems staging. We should attempt to understand and resolve why.