Closed tjwatson closed 9 months ago
POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:
I've opened the following story under this feature for the EJB Container app start settings that should be considered for CRIU, as mentioned above from the POC Review.
CRIU : EJBContainer settings for deferred EJB initialization and bean pool preloading
POC Review - Part 2 - recording: https://ibm.box.com/s/gwfjl8mkv14bh6uyg6kmk3dw5v7e0l0z Comments
POC Review - Part 3 - recording: https://ibm.box.com/s/zq1ks3la1evzvpqjv803d9x0xo8n5gcl No issues noted.. There will be a part 4.
POC Review - Part 4 - Comments Monitoring - update regarding usage metering (defer until after restore)? -> needs to be somewhere in the UFO validate monitoring is working correctly/lazily and doesn't inhibit first request performance on restore Beta - Elevated privileges - Needs a follow-up offline discussion Automated Testing - In-Container: Consider using MicroShed Platform / Cloud Considerations - want checkpoint to be integrated into docker build flow (currently must be during docker run due to need for elevated privileges that docker build won't have) https://github.com/CRaC/docs#quarkus s2i (RH) & cloud native buildbacks (Pivotol) need to be considered A11y - needs consideration due to command line additions
- The server script should handle finding a saved image but not finding the CRIU package, failing gracefully
This is already handled by the server script. The following is currently displayed:
bin/server: 1328: criu: not found
CWWKE0957I: Restoring the checkpoint server process failed. Check the .../logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
Launching <server name> (Open Liberty 22.0.0.4/wlp-1.0.63.202203290952) on Eclipse OpenJ9 VM, version 11.0.15-internal+0-adhoc.jenkins.BuildJDK11x86-64linuxcriuPersonal (en_US)
[AUDIT ] CWWKE0001I: The server <server-name> has been launched.
...
- Should the package command include a checkpoint image? Is the image even portable?
Opened issue #20649 to track
POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:
- Work with Tracy Burroughs about how the EJB feature startup might change the defaults to behave better in a restored environment, although there was concern that these changes might cause EJB applications to fail to start
Tracy has opened an issue to track: #20368
- be more clear that this is a Linux only thing
UFO updated
- Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore
Opened issue #20653 to track
- There was a discussion at the end about use for development vs. production. If we've got a decision about that, maybe be clear what it is :-)
I updated the UFO to make it clear this is not intended to improve the developer experience.
- Further thought on what happens if an image restarts in the same pod (reused, not copied from the docker image every time).
The workarea is now backed up even in container now with commit 864a402daca097f00f18a4c32b40a01faadbb251. This allows a container that launches a checkpoint process to be restarted and launch the checkpoint process again.
- Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore
Opened another issue #20660 for conditionally enabling components once the process is restored.
- Should we detect and report that we are ignoring differences in things like heap size between the checkpoint image and the current environment?
We have issue #18543 open to track that. Nothing was decided on the UFO review on how to handle that. Options:
The issue with 1 is that it takes extra work on restore to determine something changed when the common scenario is nothing changed.
The issue with 2 is that there is no notification mechanism for the things we can reify at restore time (e.g. system properties). So nothing will know they updated. Perhaps similar to system env which we do allow to change on restore.
3 is the most simple, but requires good documentation on the limitations and may confuse users that do not know about the limitation.
UFO Review May 15, 2023
Need to schedule a follow-on to finish up the rest of the UFO.
Serviceability Approval Comment - Please answer the following questions for serviceability approval:
UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation? - Answer: The most likely customer issue are described in the UFO under the Servicability heading starting on page 104.
Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. L2, test team, or another development team).
a) What problem paths were tested and demonstrated?
b) Who did you demo to?
c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that L2 should be able to quickly address those problems without need to engage L3? Answer: we have discussed with L2 (jjurece@us.ibm.com, vikramt@us.ibm.com). A demo and skill transfer to be scheduled for the 3rd week in June
SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? Answer: Tam Dinh b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that L2 should be able to quickly address those problems without need to engage L3? Answer: yes we agree based on SVT early involvement in beta tests
Which L2 / L3 queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective L2/L3 teams know they are supporting it. Ask Don Bourne if you need links or more info. Answer: The l3 team will be was-squad-osgi, the L2 queue will be WAS L2: ADM
Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs? Answer: No new metrics or JSON events.
UFO Review May 24, 2023
Left off at chart 72 (hooks for Transactions)...to be continued.
David is completing the InstantOn topic at https://github.com/OpenLiberty/docs/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc . Approving this feature.
Approve proceeding to gold driver without FAT, SVT or globalization approval.
It is amazing that SVT team (@tam512 , @Jonathan-Maciel , @bconey) is able to finish all the testing planned for InstanOn feature in short duration of time. Thanks for putting all the extra effort and weekend work.
Testing was done on Open Liberty and WebSphere Liberty full and kernel images with Java 11 and Java 17. We tested checkpoint and restore on 3 Fyre VMs (AMD RHEL 9.0, AMD Ubuntu 22.04 and Intel Ubuntu 22.04). We deployed checkpoint application images on Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Service (AKS) using Open Liberty operator and WebSphere Liberty operator. We also tested Knative and run 24 hours stress test.
I’ve removed the FAT approval from the feature after some internal discussions. We can discuss exact thresholds for granting approval again later, but for now Tom and his team have the action of getting tests up and running in more of our automated daily and personal builds. There should be a subset of existing SOE platforms that meet all of the requirements for InstantOn to be runnable (Linux, x86-64, Semeru, etc.), and Tom and company will likely be needing help to make the appropriate updates/installations to those environments for what they need. Once those tasks are done, we’ll have coverage in the main automated builds as well as the SOE. Beyond that, I’m not aware of anything that would prevent approval being re-granted.
any update on the outstanding FAT work?
Test machines are coming on line to run the InstantOn FATs in the SOE this weekend. If all goes well then I think we should get approval.
Any update on the outstanding FAT work and closing out this feature?
Still carrying this as technical debt. Is there a target for completion?
There's been some recent meetings regarding this, and ownership of the work has been assigned . However, I'm not aware of a target completion for it.
Do we have an update on target completion?
This work item has been long delivered and made publicly available, and as such I'm closing it.
Description
Use CRIU as the mechanism to achieve InstantOn startup in Liberty running on Linux. This feature will look to expand on the experiments done in the blog https://openliberty.io/blog/2020/02/12/faster-startup-Java-applications-criu.html
To achieve InstantOn Open Liberty will work with the Open J9 team to define Open J9 API make native linux calls to the CRIU support available on Linux. Liberty will provide an SPI that allows features to hook into the checkpoint operation in order to prepare the system for a checkpoint and to hook into the restore operation to fix up the system (if necessary) when restoring the server from a checkpoint.
Collaboration with Open J9 will be necessary to make sure Open J9 provides the necessary APIs required by Liberty to do a checkpoint and a restore operation. Some things in the system will require Open J9 to fix them up to allow for proper behavior after restore. For example, access to the current environment values when restoring. Making sure objects like SecureRandom and Timers behave properly on restore.
Documents
When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.
Aha: Externally raised RFE ([Aha]())
UFO: Checkpoint feature UFO
FTS: InstantOn Feature Test Summary
Beta Blog: Link to Beta Blog Post GH Issue
GA Blog: Link to GA Blog Post GH Issue
Process Overview
Prioritization
Design
Implementation
Legal and Translation
Beta
GA
Other Deliverables
General Instructions
The process steps occur roughly in the order as presented. Process steps occasionally overlap.
Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").
Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.
If you need assistance, reach out to the OpenLiberty/release-architect.
Important: Labels are used to trigger particular steps and must be added as indicated.
Prioritization (Complete Before Development Starts)
The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.
Prioritization
[x] Feature added to the "New" column of the Open Liberty project board
[x] Priority assigned
Design (Complete Before Development Starts)
Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID.
Design Preliminaries
ID Required
, if non-trivial documentation needs to be created by the ID team.ID Required - Trivial
, if no design will be performed and only trivial ID updates are needed.Design
Design Review Request
Design Approved
No Design
No Design Approval Request
No Design Approved
FAT Documentation
[x] "Feature Test Summary" child task created
Implementation
A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it
kind=noship
or beta fencing it.Code may not GA until this feature has obtained the "Design Approved" or "No Design Approved" label, along with all other tasks outlined in the GA section.
Feature Development Begins
In Progress
labelLegal and Translation
In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.
Legal (Complete before Feature Complete Date)
Translation (Complete 1 week before Feature Complete Date)
Innovation (Complete 1 week before Feature Complete Date)
[x] Consider whether any aspects of the feature may be patentable. If any identified, disclosures have been submitted.
Beta
In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.
Beta Code
kind=beta
,ibm:beta
,ProductInfo.getBetaEdition()
target:beta
and the appropriatetarget:YY00X-beta
(where YY00X is the targeted beta version).release:YY00X-beta
(where YY00X is the first beta version that included the functionality).Beta Blog (Complete 1.5 weeks before beta eGA)
[x] Beta blog issue created and populated using the Open Liberty BETA blog post template.
GA
A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.
Feature Complete
target:ga
and the appropriatetarget:YY00X
(where YY00X is the targeted GA version).Focal Point Approvals (Complete by Feature Complete Date)
These occur only after GA of this feature is requested (by adding a
target:ga
label). GA of this feature may not occur until all approvals are obtained.All Features
focalApproved:externals
@OpenLiberty/demo-approvers Demo scheduled for EOI [Iteration Number]
to this issue.focalApproved:demo
.focalApproved:fat
.focalApproved:globalization
.Design Approved Features
focalApproved:accessibility
.focalApproved:id
.focalApproved:performance
.focalApproved:sve
.focalApproved:ste
.focalApproved:svt
.Remove Beta Fencing (Complete by Feature Complete Date)
GA Blog (Complete by Feature Complete Date)
Post GA
[x] Replace
target:YY00X
label with the appropriaterelease:YY00X
. (OpenLiberty/release-manager)Other Deliverables
[x] Standalone Feature Blog Post A blog post specifically about your feature or N/A. (OpenLiberty/release-architect)
[ ] OL Guides OL Guides assessment is complete or N/A. (OpenLiberty/guide-assessment)
[x] Dev Experience Developer Experience & Tools work is complete or N/A. (OpenLiberty/dev-experience-assessment)