AdoptOpenJDK / TSC

The AdoptOpenJDK Technical Steering Committee - Also acts as the knowledge portal for the Adopt OpenJDK GitHub projects
71 stars 33 forks source link

Retrospective for October 2020 releases #181

Closed adamfarley closed 3 years ago

adamfarley commented 3 years ago

Topics for the retrospective should include:

aahlenst commented 3 years ago

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

adamfarley commented 3 years ago

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

Agreed. Or automation. Or an automated checklist. Let's discuss during the retrospective meeting.

karianna commented 3 years ago

We should only run the main 3 platforms first for both OpenJ9 and Hotspot then run pipelines for the secondary platforms.

We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.

sxa commented 3 years ago

Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?

Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.

I switched off /testing/ of the nightlies via the default checkboxes in the openjdkxx-pipeline jobs (since that's the bit that's generally disruptive) but it seemed to get re-enabled somehow - maybe by a pipeline regeneration?. ~I queried what the best way to do it this time would be in https://adoptopenjdk.slack.com/archives/C09NW3L2J/p1602789550128600~

Entire build can be stopped by adjusting the triggerSchedule in pipelines/jobs/configurations/jdk*.groovy, or to switch off the tests the lines in the jdk*_pipeline_config.groovy files need to be modified to have false in the test fields.

sxa commented 3 years ago

We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.

If we're going down that route we should implement a formal platform tier proposal (which could lead to interesting discussoins, but I'm guessing you're thinking about x86 win/mac/linux as primaries for now?) . Being devil's advocate, is there a specific problem you see that means those should be kicked off first? Obviously the others aren't competing for the same resources (unless we push the OpenJ9 XL ones out of "primary")

sxa commented 3 years ago

Retrospective item: I feel a lot of disucssions over the last 18 hours seem to have been done outside the #release channel in slack. We need to make sure current status of release-related activity is done in one place (including initiation of any calls) to make sure we're all up to date and pulling in the same direction.

karianna commented 3 years ago

Despite commenting out the default weekly map there were still instances in the jdk_pipeline_config.groovy files which stacked up weekly tests on platforms that didn't have enough hardware to support the run (e.g. Java 11 aarch64).

The default map is comprehensive and we should use that going forward and simply get our infra support up.

A secondary concern is that we are more explicit that we're using a default weekly map in the jdk_pipeline_config.groovy - the naive engineer may get confused overseeing an empty map in most cases.

andrew-m-leonard commented 3 years ago

"Handover situations": If builds go on for several days for whatever reasons, it is not necessarily the case the same person will be handling a given release. We need to make such handovers easier, rather than trying to figure out from numerous slack messages in various channels. A more focussed/managed release checklist with status ? (@smlambert I know you've mentioned this previously)

smlambert commented 3 years ago

re: https://github.com/AdoptOpenJDK/TSC/issues/181#issuecomment-715402339 - yes @andrew-m-leonard, see https://github.com/AdoptOpenJDK/TSC/issues/178 for a WIP checklist that is intended to make it more obvious what has already occurred and by whom.

adamfarley commented 3 years ago

Issue: Job generation doesn't appear to be reliably thread-safe, especially the concurrent test job generation we do at the end of a build.

Evidence: Groovy's struggle to load the same library in multiple concurrent threads (runTests() in openjdk_build_pipeline.groovy), and the non-fatal "No suitable checks publisher found" issue that springs up in many test runs [(Slack thread)].(https://adoptopenjdk.slack.com/archives/CLCFNV2JG/p1603464619103400)

Potential solution: If there's a way to launch jobs in a non-blocking way, we could loop over the job-generation step for each test job we want to run after a build (in a single thread), and then "check" for job results in a second loop. Once we have "results" for each test job we generated, the second loop breaks out and we continue.

smlambert commented 3 years ago

re: https://github.com/AdoptOpenJDK/TSC/issues/181#issuecomment-715417094 - what are you intending to solve? Is it meant to address the question: How the test jobs being unable to launch somehow didn't cause build failure ?

If so, perhaps some background:

But maybe I misunderstand what your comment is targetting...

smlambert commented 3 years ago

Why the Windows 64bit build here failed after complaining: "warning: failed to remove openj9/test/functional: Directory not empty".

This is a known, long-standing, problematic issue that appears to have triggered the raising of many infra issues in the past, where Jenkins jobs are unable to clean out the previous workspace (or their own at the end of their run) and other jobs fail with AccessDeniedExceptions.

All of these issues relate to the same core issue: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1396 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1527 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1419 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1410 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1394 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1379 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1376 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1339 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1328 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1310 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1086 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/962 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/810 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/784 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/736 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/706 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/477 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/417 https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/23

We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

adamfarley commented 3 years ago

re: #181 (comment) - what are you intending to solve?

The problems in "Evidence", which could perhaps be renamed to "Symptoms". Now we have two issues that could be traced back to us trying to use concurrency and build generation together. I was spitballing a simplistic way for us to achieve multiple concurrent jobs, while generating them in a serial manner (possibly avoiding the non-thread-safe(?) build generation).

The test jobs failing to run is a symptom. The fact that their failure didn't cause the build to fall over is either a non-issue or a separate issue.

adamfarley commented 3 years ago

re: #181 (comment) - We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?

smlambert commented 3 years ago

Can you find the discussion and indicate what is meant by 'nuking'? There appears to be a great many comments in the open/closed infra issues listed above.

From a test pipeline perspective, the best nuking we could do means call cleanWs(). We used to do so at the start of each test run.

Then, do to pressure to not take up space on machines, we moved it to the end of every run, https://github.com/AdoptOpenJDK/openjdk-tests/issues/314.

We could call cleanWs() both at start and end of each run (taking a small hit on adding some execution minutes), but the core issue is that the cleanWs() call sometimes fails to work when run on Windows machines no matter when or how frequently you call it.

All of this is perhaps a non-issue if we spin up fresh machines on the fly, but we are not really there (and not sure if that is in our infra goals or not).

sxa commented 3 years ago

Shenandoah was not enabled for the JDK11u release - fix in https://github.com/AdoptOpenJDK/openjdk-build/pull/2177

sxa commented 3 years ago

Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?

If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.

We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.

I think @Willsparker has deal with more of these recently (so may have an idea of how to fix properly, but let's go into that in a separate infra issue) and has been able to diagnose some of the locked workspaces, but I agree it is probably our most common recurring issue and we need to understand and resolve and try to write some auomated mitigation going forward.

sxa commented 3 years ago

apt installers for 8u272 suffers a gap in update time which affects end users - infra#1647

sxa commented 3 years ago

Getting into the realm of solutions here already, but something I've been doing in the infrastructure repo and I thin kwe should roll out to at least the build one.

Both of these support the following:

I think this would make the merging process less error prone and avoid "fire and forget" PRs going in without being verified, which we seem to have had quite a lot of in recent months. I'm loathed to add an extra "verify" step to the workflow but maybe we do need to say that something shouldn't be moved to "Done" until something has been confirmed.

sxa commented 3 years ago

Also I suggest we split out the "promote release" issues into HotSpot and OpenJ9 ones for various reasons:

Willsparker commented 3 years ago

FYI, https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1573 is where I'm looking at the Windows workspace based issues. The main issue is leftover java.exe processes stopping Jenkins from deleting workspaces.

andrew-m-leonard commented 3 years ago

Proposal: "Dry-run" Release How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

adamfarley commented 3 years ago

re: #181 (comment) - Can you find the discussion and indicate what is meant by 'nuking'?

I think I meant just running cleanWs() at the start of each run, though as Stewart says:

re: #181 comment - If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.

So perhaps one way forward is to run cleanWs() at the start and end of each run, as Shelley suggests, and to answer every instance of issues like "locked folder" with a fix in cleanWs() that makes it more effective.

adamfarley commented 3 years ago

comment #181 - Proposal: "Dry-run" Release How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

Seems reasonable. We should also aim to cut down build/test/etc repo contributions during the "dry-run & release" period, so we can avoid new issues sneaking in after the dry-run but before the release.

sxa commented 3 years ago

How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?

We could also do it as soon as we enforce build repo lockdown, which varies but is usually on the Thursday/Friday before release. That way nothing else should be going in. Of course it depends how quickly we think we can fix things if they are faulty :-)

sxa commented 3 years ago

Docker release of arm32v7 had not appeared by today (10th november) despite being shipped about 14 days ago

mbien commented 3 years ago

a high level suggestion from a observer: a possible mitigation of those kind of issues would be to let adopt build rc builds too. OpenJDK has rc builds at least a month before release. If adopt would build them too (as if they would be a release), potential issues could be noticed much earlier and likely solved until release. This might cause a boring release week though ;)

sxa commented 3 years ago

@mbien Haha a boring release week sounds like bliss! I think we will end up doing some sort of pre-release trial. A month is possibly a bit too far for us because a lot can happen in the month before GA when we're on a three month release cycle (and it's quite rare that a code issue from openjdk trips us up) :-)

Thanks for the input

sxa commented 3 years ago

Issue with macos packaging missing JREs: https://github.com/AdoptOpenJDK/homebrew-openjdk/issues/495#issuecomment-729771679

sxa commented 3 years ago

Multiple issues relating to the 11.0.9.1+1 version string which we had to address both in the build repository and the API:

sxa commented 3 years ago

Summary of everythign above (a.k.a. an easy-to-use agenda for the meeting to be held on Monday at 1400 GMT/UTC). The initials of the person who raised it in the conversations above are in []

One-off things (likely don't need much discussion)

Issues:

Questions:

References:

adamfarley commented 3 years ago

Meeting Results:

(Note: See the next comment for a concise list of Actions)

One-off things (likely don't need much discussion)

Questions:

References:

Releasing document in the build repository WIP release checklist document based on the releasing doc

adamfarley commented 3 years ago

Actions list:

Adam Farley:

George & Stewart

George Adams

Stewart Addison

Andrew Leonard

adamfarley commented 3 years ago

Since the actions for this will be chased independently by their respective owners, this issue will now be closed.

Thank you everyone for participating.

sxa commented 3 years ago

@adamfarley I feel this probably shouldn't be closed until we have issues covering them, otherwise the work looks complete as there is no outstanding issues for several of these with owners

adamfarley commented 3 years ago

Will reopen if you think that will encourage folks to follow up.

My thought was that it'd be easier to close this and simply copy the actions into January's retrospective issue, reviewing the results then.

I think the right way forward is to reopen it as you suggest, and to copy the actions once people have had a chance to update them with links to their issues.

adamfarley commented 3 years ago

Note: Any unresolved actions have been folded into the next retrospective for review. Link.

If any have been unintentionally missed, feel free to add them.