Liberty checkpoint feature - InstantOn restore server process

tjwatson commented 3 years ago

Description

Use CRIU as the mechanism to achieve InstantOn startup in Liberty running on Linux. This feature will look to expand on the experiments done in the blog https://openliberty.io/blog/2020/02/12/faster-startup-Java-applications-criu.html

To achieve InstantOn Open Liberty will work with the Open J9 team to define Open J9 API make native linux calls to the CRIU support available on Linux. Liberty will provide an SPI that allows features to hook into the checkpoint operation in order to prepare the system for a checkpoint and to hook into the restore operation to fix up the system (if necessary) when restoring the server from a checkpoint.

Collaboration with Open J9 will be necessary to make sure Open J9 provides the necessary APIs required by Liberty to do a checkpoint and a restore operation. Some things in the system will require Open J9 to fix them up to allow for proper behavior after restore. For example, access to the current environment values when restoring. Making sure objects like SecureRandom and Timers behave properly on restore.

Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

Aha: Externally raised RFE ([Aha]())
UFO: Checkpoint feature UFO
FTS: InstantOn Feature Test Summary
Beta Blog: Link to Beta Blog Post GH Issue
GA Blog: Link to GA Blog Post GH Issue
Process Overview
Prioritization
Design
Implementation
Legal and Translation
Beta
GA
- Focal Point Approvals
Other Deliverables

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.

Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

[x] Feature added to the "New" column of the Open Liberty project board
- Epics can be added to the board in one of two ways:
- From this issue, use the "Projects" section to select the appropriate project board.
- From the appropriate project board click "Add card" and select your Feature Epic issue
[x] Priority assigned
- Attend the Liberty Backlog Prioritization meeting
Design (Complete Before Development Starts)

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID.

Design Preliminaries

[x] UI requirements identified. (Owner and UI focal point)
[x] ID requirements identified. (Owner and ID focal point)
- Refer to Documenting Open Liberty.
- Feature Owner adds label ID Required, if non-trivial documentation needs to be created by the ID team.
- ID adds label ID Required - Trivial, if no design will be performed and only trivial ID updates are needed.
[x] Serviceability Requirements Identified. (Owner and Serviceability focal point)
[x] SVT Requirements Identified. (Owner and SVT focal point)
[x] Performance testing requirements identified. (Owner and Performance focal point)

Design

[x] POC Design / UFO review requested.
- Owner adds label Design Review Request
[x] POC Design / UFO review scheduled.
- Follow the instructions in POC-Forum repo
[x] POC Design / UFO review completed.
- UFO review recording: part 1
- UFO review recording: part 2
- UFO review recording: part 3
- UFO review recording: part 4
[x] POC / UFO Review follow-ons completed.
[x] Design / UFO approved. (OpenLiberty/chief-architect) or N/A
- (OpenLiberty/chief-architect) adds label Design Approved
- Add the public link to the UFO in Box to the Documents section.
- The UFO must always accurately reflect the final implementation of the feature. Any changes must be first approved. Afterwards, update the UFO by creating a copy of the original approved slide(s) at the end of the deck and prepend "OLD" to the title(s). A single updated copy of the slide(s) should take the original's place, and have its title(s) prepended with "UPDATED".

No Design

[ ] No Design requested.
- Owner adds label No Design Approval Request
[ ] No Design / No UFO approved. (OpenLiberty/chief-architect) or N/A
- Approver adds label No Design Approved

FAT Documentation

[x] "Feature Test Summary" child task created
- Use the Feature Test Summary Template
- Add FTS issue link to the Documents section.
Implementation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the "Design Approved" or "No Design Approved" label, along with all other tasks outlined in the GA section.

Feature Development Begins

[x] Add the In Progress label

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

[x] Changed or new open source libraries are cleared and approved, or N/A. (Legal Release Services/Cass Tucker/Release PM).
[x] Licenses and Certificates of Originality (COOs) are updated, or N/A.

Translation (Complete 1 week before Feature Complete Date)

[x] PII updates are merged, or N/A. Note timing with translation shipments.

Innovation (Complete 1 week before Feature Complete Date)

[x] Consider whether any aspects of the feature may be patentable. If any identified, disclosures have been submitted.
Beta

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

[x] Beta fence the functionality
- kind=beta, ibm:beta, ProductInfo.getBetaEdition()
[x] Beta development complete and feature ready for inclusion in a beta release
- Add label target:beta and the appropriate target:YY00X-beta (where YY00X is the targeted beta version).
[x] Feature delivered into beta
- (OpenLiberty/release-manager) adds label release:YY00X-beta (where YY00X is the first beta version that included the functionality).

Beta Blog (Complete 1.5 weeks before beta eGA)

[x] Beta blog issue created and populated using the Open Liberty BETA blog post template.
- Add a link to the beta blog issue in the Documents section.
- Note: This is for inclusion into the overall beta release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.
GA

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

[x] Feature implementation and tests completed.
- [x] All PRs are merged.
- [x] All epic and child issues are closed.
- [x] All stop ship issues are completed.
[x] Legal: all necessary approvals granted.
[x] Translation: All messages translated or sent for translation for upcoming release
[x] GA development complete and feature ready for inclusion in a GA release
- Add label target:ga and the appropriate target:YY00X (where YY00X is the targeted GA version).
- Inclusion in a release requires the completion of all Focal Point Approvals.

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

[x] APIs/Externals Externals have been reviewed or N/A. (OpenLiberty/externals-approvers)
- Approver adds label focalApproved:externals
[x] Demo Demo is scheduled for an upcoming EOI or N/A. (OpenLiberty/demo-approvers)
- Add comment @OpenLiberty/demo-approvers Demo scheduled for EOI [Iteration Number] to this issue.
- Approver adds label focalApproved:demo.
[ ] FAT All Tests complete and running successfully in SOE or N/A. (OpenLiberty/fat-approvers)
- Approver adds label focalApproved:fat.
[x] Globalization Translation and TVT are complete or N/A. (OpenLiberty/globalization-approvers)
- Approver adds label focalApproved:globalization.

Design Approved Features

[x] Accessibility Accessibility testing completed or N/A. (OpenLiberty/accessibility-approvers)
- Approver adds label focalApproved:accessibility.
[x] ID Documentation is complete or N/A. (OpenLiberty/id-approvers)
- Approver adds label focalApproved:id.
- NOTE: If only trivial documentation changes are required, you may reach out to the ID Feature Focal to request a ID Required - Trivial label. Unlike features with regular ID requirement, those with ID Required - Trivial label do not have a hard requirement for a Design/UFO.
[x] Performance Performance testing is complete or N/A. (OpenLiberty/performance-approvers)
- Approver adds label focalApproved:performance.
[x] Serviceability Serviceability has been addressed or N/A. (OpenLiberty/serviceability-approvers)
- Approver adds label focalApproved:sve.
[x] STE Skills Transfer Education chart deck is complete or N/A. (OpenLiberty/ste-approvers)
- Approver adds label focalApproved:ste.
[x] SVT System Verification Test is complete or N/A. (OpenLiberty/svt-approvers)
- Approver adds label focalApproved:svt.

Remove Beta Fencing (Complete by Feature Complete Date)

[x] Beta guards are removed, or N/A
- Only after all necessary Focal Point Approvals have been granted.

GA Blog (Complete by Feature Complete Date)

[x] GA Blog issue created and populated using the Open Liberty GA release blog post template.
- Add a link to the GA Blog issue in the Documents section.

Post GA

[x] Replace target:YY00X label with the appropriate release:YY00X. (OpenLiberty/release-manager)
Other Deliverables
[x] Standalone Feature Blog Post A blog post specifically about your feature or N/A. (OpenLiberty/release-architect)
- This should be strongly considered for larger or more prominent features.
- Follow instructions in the blogs repo.
[ ] OL Guides OL Guides assessment is complete or N/A. (OpenLiberty/guide-assessment)
[x] Dev Experience Developer Experience & Tools work is complete or N/A. (OpenLiberty/dev-experience-assessment)

follis commented 2 years ago

POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:

Work with Tracy Burroughs about how the EJB feature startup might change the defaults to behave better in a restored environment, although there was concern that these changes might cause EJB applications to fail to start
be more clear that this is a Linux only thing
Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore
There was a discussion at the end about use for development vs. production. If we've got a decision about that, maybe be clear what it is :-)

tkburroughs commented 2 years ago

I've opened the following story under this feature for the EJB Container app start settings that should be considered for CRIU, as mentioned above from the POC Review.

CRIU : EJBContainer settings for deferred EJB initialization and bean pool preloading

follis commented 2 years ago

POC Review - Part 2 - recording: https://ibm.box.com/s/gwfjl8mkv14bh6uyg6kmk3dw5v7e0l0z Comments

The server script should handle finding a saved image but not finding the CRIU package, failing gracefully
Should the package command include a checkpoint image? Is the image even portable?
Should we detect and report that we are ignoring differences in things like heap size between the checkpoint image and the current environment?
Further thought on what happens if an image restarts in the same pod (reused, not copied from the docker image every time).
A thing from Ian about understanding more about how this works on core OS. Not sure I captured that properly..

follis commented 2 years ago

POC Review - Part 3 - recording: https://ibm.box.com/s/zq1ks3la1evzvpqjv803d9x0xo8n5gcl No issues noted.. There will be a part 4.

mbroz2 commented 2 years ago

POC Review - Part 4 - Comments Monitoring - update regarding usage metering (defer until after restore)? -> needs to be somewhere in the UFO validate monitoring is working correctly/lazily and doesn't inhibit first request performance on restore Beta - Elevated privileges - Needs a follow-up offline discussion Automated Testing - In-Container: Consider using MicroShed Platform / Cloud Considerations - want checkpoint to be integrated into docker build flow (currently must be during docker run due to need for elevated privileges that docker build won't have) https://github.com/CRaC/docs#quarkus s2i (RH) & cloud native buildbacks (Pivotol) need to be considered A11y - needs consideration due to command line additions

tjwatson commented 2 years ago

The server script should handle finding a saved image but not finding the CRIU package, failing gracefully

This is already handled by the server script. The following is currently displayed:

bin/server: 1328: criu: not found
CWWKE0957I: Restoring the checkpoint server process failed. Check the .../logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
Launching <server name> (Open Liberty 22.0.0.4/wlp-1.0.63.202203290952) on Eclipse OpenJ9 VM, version 11.0.15-internal+0-adhoc.jenkins.BuildJDK11x86-64linuxcriuPersonal (en_US)
[AUDIT   ] CWWKE0001I: The server <server-name> has been launched.
...

tjwatson commented 2 years ago

Should the package command include a checkpoint image? Is the image even portable?

Opened issue #20649 to track

tjwatson commented 2 years ago

POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:

Work with Tracy Burroughs about how the EJB feature startup might change the defaults to behave better in a restored environment, although there was concern that these changes might cause EJB applications to fail to start

Tracy has opened an issue to track: #20368

be more clear that this is a Linux only thing

UFO updated

Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore

Opened issue #20653 to track

There was a discussion at the end about use for development vs. production. If we've got a decision about that, maybe be clear what it is :-)

I updated the UFO to make it clear this is not intended to improve the developer experience.

tjwatson commented 2 years ago

Further thought on what happens if an image restarts in the same pod (reused, not copied from the docker image every time).

The workarea is now backed up even in container now with commit 864a402daca097f00f18a4c32b40a01faadbb251. This allows a container that launches a checkpoint process to be restarted and launch the checkpoint process again.

tjwatson commented 2 years ago

Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore

Opened another issue #20660 for conditionally enabling components once the process is restored.

tjwatson commented 2 years ago

Should we detect and report that we are ignoring differences in things like heap size between the checkpoint image and the current environment?

We have issue #18543 open to track that. Nothing was decided on the UFO review on how to handle that. Options:

Issue warnings if we detect stuff like bootstrap.properties, jvm.options or command line options changed from checkpoint to restore.
Try to process stuff that we can on restore, like bootstrap.properties, maybe command line options but issue warnings for things we cannot properly process on restore.
Do nothing, but document the limitations.

The issue with 1 is that it takes extra work on restore to determine something changed when the common scenario is nothing changed.

The issue with 2 is that there is no notification mechanism for the things we can reify at restore time (e.g. system properties). So nothing will know they updated. Perhaps similar to system env which we do allow to change on restore.

3 is the most simple, but requires good documentation on the limitations and may confuse users that do not know about the limitation.

follis commented 1 year ago

UFO Review May 15, 2023

What to do if server start from a restored environment fails? Fail to start may just cause whatever is managing the container to try again and again, but just starting normally could result in an undetected error (we would report it, but who looks if the server starts?).

Need to schedule a follow-on to finish up the rest of the UFO.

donbourne commented 1 year ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation? - Answer: The most likely customer issue are described in the UFO under the Servicability heading starting on page 104.
Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. L2, test team, or another development team).
a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that L2 should be able to quickly address those problems without need to engage L3? Answer: we have discussed with L2 (jjurece@us.ibm.com, vikramt@us.ibm.com). A demo and skill transfer to be scheduled for the 3rd week in June
SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? Answer: Tam Dinh b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that L2 should be able to quickly address those problems without need to engage L3? Answer: yes we agree based on SVT early involvement in beta tests
Which L2 / L3 queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective L2/L3 teams know they are supporting it. Ask Don Bourne if you need links or more info. Answer: The l3 team will be was-squad-osgi, the L2 queue will be WAS L2: ADM
Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs? Answer: No new metrics or JSON events.

follis commented 1 year ago

UFO Review May 24, 2023

Consider whether the Drop: ALL should be part of the user experience
slide 45 - CheckpointHook - fix the @see CheckpointFactory if needed
Make clear that the various hooks described in the API/SPI section are intended to be externals eventually, but not yet
properly document cases where MP technologies read stuff far too early
Chart 69 - clarify this is HTTPSession (not security stuff)
Chart 71 - typo 'KeyStoreConfiguraitonFactory'

Left off at chart 72 (hooks for Transactions)...to be continued.

chirp1 commented 1 year ago

David is completing the InstantOn topic at https://github.com/OpenLiberty/docs/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc . Approving this feature.

NottyCode commented 1 year ago

Approve proceeding to gold driver without FAT, SVT or globalization approval.

mtamboli commented 1 year ago

It is amazing that SVT team (@tam512 , @Jonathan-Maciel , @bconey) is able to finish all the testing planned for InstanOn feature in short duration of time. Thanks for putting all the extra effort and weekend work.

Testing was done on Open Liberty and WebSphere Liberty full and kernel images with Java 11 and Java 17. We tested checkpoint and restore on 3 Fyre VMs (AMD RHEL 9.0, AMD Ubuntu 22.04 and Intel Ubuntu 22.04). We deployed checkpoint application images on Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Service (AKS) using Open Liberty operator and WebSphere Liberty operator. We also tested Knative and run 24 hours stress test.

ayoho commented 1 year ago

I’ve removed the FAT approval from the feature after some internal discussions. We can discuss exact thresholds for granting approval again later, but for now Tom and his team have the action of getting tests up and running in more of our automated daily and personal builds. There should be a subset of existing SOE platforms that meet all of the requirements for InstantOn to be runnable (Linux, x86-64, Semeru, etc.), and Tom and company will likely be needing help to make the appropriate updates/installations to those environments for what they need. Once those tasks are done, we’ll have coverage in the main automated builds as well as the SOE. Beyond that, I’m not aware of anything that would prevent approval being re-granted.

LifeIsGood524 commented 1 year ago

any update on the outstanding FAT work?

tjwatson commented 1 year ago

Test machines are coming on line to run the InstantOn FATs in the SOE this weekend. If all goes well then I think we should get approval.

LifeIsGood524 commented 1 year ago

Any update on the outstanding FAT work and closing out this feature?

LifeIsGood524 commented 11 months ago

Still carrying this as technical debt. Is there a target for completion?

mbroz2 commented 11 months ago

There's been some recent meetings regarding this, and ownership of the work has been assigned . However, I'm not aware of a target completion for it.

LifeIsGood524 commented 9 months ago

Do we have an update on target completion?

mbroz2 commented 9 months ago

This work item has been long delivered and made publicly available, and as such I'm closing it.

OpenLiberty / open-liberty