OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 591 forks source link

Liberty checkpoint feature - InstantOn restore server process #16384

Closed tjwatson closed 9 months ago

tjwatson commented 3 years ago

Description

Use CRIU as the mechanism to achieve InstantOn startup in Liberty running on Linux. This feature will look to expand on the experiments done in the blog https://openliberty.io/blog/2020/02/12/faster-startup-Java-applications-criu.html

To achieve InstantOn Open Liberty will work with the Open J9 team to define Open J9 API make native linux calls to the CRIU support available on Linux. Liberty will provide an SPI that allows features to hook into the checkpoint operation in order to prepare the system for a checkpoint and to hook into the restore operation to fix up the system (if necessary) when restoring the server from a checkpoint.

Collaboration with Open J9 will be necessary to make sure Open J9 provides the necessary APIs required by Liberty to do a checkpoint and a restore operation. Some things in the system will require Open J9 to fix them up to allow for proper behavior after restore. For example, access to the current environment values when restoring. Making sure objects like SecureRandom and Timers behave properly on restore.


Documents

When available, add links to required feature documents. Use "N/A" to mark particular documents which are not required by the feature.

General Instructions

The process steps occur roughly in the order as presented. Process steps occasionally overlap.

Each process step has a number of tasks which must be completed or must be marked as not applicable ("N/A").

Unless otherwise indicated, the tasks are the responsibility of the Feature Owner or a Delegate of the Feature Owner.

If you need assistance, reach out to the OpenLiberty/release-architect.

Important: Labels are used to trigger particular steps and must be added as indicated.


Prioritization (Complete Before Development Starts)

The (OpenLiberty/chief-architect) and area leads are responsible for prioritizing the features and determining which features are being actively worked on.

Prioritization

Design preliminaries determine whether a formal design, which will be provided by an Upcoming Feature Overview (UFO) document, must be created and reviewed. A formal design is required if the feature requires any of the following: UI, Serviceability, SVT, Performance testing, or non-trivial documentation/ID.

Design Preliminaries

Design

No Design

FAT Documentation

A feature must be prioritized before any implementation work may begin to be delivered (inaccessible/no-ship). However, a design focused approach should still be applied to features, and developers should think about the feature design prior to writing and delivering any code.
Besides being prioritized, a feature must also be socialized (or No Design Approved) before any beta code may be delivered. All new Liberty content must be inaccessible in our GA releases until it is Feature Complete by either marking it kind=noship or beta fencing it.
Code may not GA until this feature has obtained the "Design Approved" or "No Design Approved" label, along with all other tasks outlined in the GA section.

Feature Development Begins

Legal and Translation

In order to avoid last minute blockers and significant disruptions to the feature, the legal items need to be done as early in the feature process as possible, either in design or as early into the development as possible. Similarly, translation is to be done concurrently with development. Both MUST be completed before Beta or GA is requested.

Legal (Complete before Feature Complete Date)

Translation (Complete 1 week before Feature Complete Date)

Innovation (Complete 1 week before Feature Complete Date)

In order to facilitate early feedback from users, all new features and functionality should first be released as part of a beta release.

Beta Code

Beta Blog (Complete 1.5 weeks before beta eGA)

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

Design Approved Features

Remove Beta Fencing (Complete by Feature Complete Date)

GA Blog (Complete by Feature Complete Date)

Post GA

follis commented 2 years ago

POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:

tkburroughs commented 2 years ago

I've opened the following story under this feature for the EJB Container app start settings that should be considered for CRIU, as mentioned above from the POC Review.

CRIU : EJBContainer settings for deferred EJB initialization and bean pool preloading

follis commented 2 years ago

POC Review - Part 2 - recording: https://ibm.box.com/s/gwfjl8mkv14bh6uyg6kmk3dw5v7e0l0z Comments

follis commented 2 years ago

POC Review - Part 3 - recording: https://ibm.box.com/s/zq1ks3la1evzvpqjv803d9x0xo8n5gcl No issues noted.. There will be a part 4.

mbroz2 commented 2 years ago

POC Review - Part 4 - Comments Monitoring - update regarding usage metering (defer until after restore)? -> needs to be somewhere in the UFO validate monitoring is working correctly/lazily and doesn't inhibit first request performance on restore Beta - Elevated privileges - Needs a follow-up offline discussion Automated Testing - In-Container: Consider using MicroShed Platform / Cloud Considerations - want checkpoint to be integrated into docker build flow (currently must be during docker run due to need for elevated privileges that docker build won't have) https://github.com/CRaC/docs#quarkus s2i (RH) & cloud native buildbacks (Pivotol) need to be considered A11y - needs consideration due to command line additions

tjwatson commented 2 years ago
  • The server script should handle finding a saved image but not finding the CRIU package, failing gracefully

This is already handled by the server script. The following is currently displayed:

bin/server: 1328: criu: not found
CWWKE0957I: Restoring the checkpoint server process failed. Check the .../logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
Launching <server name> (Open Liberty 22.0.0.4/wlp-1.0.63.202203290952) on Eclipse OpenJ9 VM, version 11.0.15-internal+0-adhoc.jenkins.BuildJDK11x86-64linuxcriuPersonal (en_US)
[AUDIT   ] CWWKE0001I: The server <server-name> has been launched.
...
tjwatson commented 2 years ago
  • Should the package command include a checkpoint image? Is the image even portable?

Opened issue #20649 to track

tjwatson commented 2 years ago

POC Review recording: https://ibm.box.com/s/7ok2a4v7rfqb6lm6zwp3ne6prlm8t2tj Comments:

  • Work with Tracy Burroughs about how the EJB feature startup might change the defaults to behave better in a restored environment, although there was concern that these changes might cause EJB applications to fail to start

Tracy has opened an issue to track: #20368

  • be more clear that this is a Linux only thing

UFO updated

  • Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore

Opened issue #20653 to track

  • There was a discussion at the end about use for development vs. production. If we've got a decision about that, maybe be clear what it is :-)

I updated the UFO to make it clear this is not intended to improve the developer experience.

tjwatson commented 2 years ago
  • Further thought on what happens if an image restarts in the same pod (reused, not copied from the docker image every time).

The workarea is now backed up even in container now with commit 864a402daca097f00f18a4c32b40a01faadbb251. This allows a container that launches a checkpoint process to be restarted and launch the checkpoint process again.

tjwatson commented 2 years ago
  • Could some features defer starting until after the restore (or maybe defer part of it) for features that do things at startup that would need to be done even after a checkpoint-restore

Opened another issue #20660 for conditionally enabling components once the process is restored.

tjwatson commented 2 years ago
  • Should we detect and report that we are ignoring differences in things like heap size between the checkpoint image and the current environment?

We have issue #18543 open to track that. Nothing was decided on the UFO review on how to handle that. Options:

  1. Issue warnings if we detect stuff like bootstrap.properties, jvm.options or command line options changed from checkpoint to restore.
  2. Try to process stuff that we can on restore, like bootstrap.properties, maybe command line options but issue warnings for things we cannot properly process on restore.
  3. Do nothing, but document the limitations.

The issue with 1 is that it takes extra work on restore to determine something changed when the common scenario is nothing changed.

The issue with 2 is that there is no notification mechanism for the things we can reify at restore time (e.g. system properties). So nothing will know they updated. Perhaps similar to system env which we do allow to change on restore.

3 is the most simple, but requires good documentation on the limitations and may confuse users that do not know about the limitation.

follis commented 1 year ago

UFO Review May 15, 2023

Need to schedule a follow-on to finish up the rest of the UFO.

donbourne commented 1 year ago

Serviceability Approval Comment - Please answer the following questions for serviceability approval:

  1. UFO -- does the UFO identify the most likely problems customers will see and identify how the feature will enable them to diagnose and solve those problems without resorting to raising a PMR? Have these issues been addressed in the implementation? - Answer: The most likely customer issue are described in the UFO under the Servicability heading starting on page 104.

  2. Test and Demo -- As part of the serviceability process we're asking feature teams to test and analyze common problem paths for serviceability and demo those problem paths to someone not involved in the development of the feature (eg. L2, test team, or another development team).
    a) What problem paths were tested and demonstrated? b) Who did you demo to? c) Do the people you demo'd to agree that the serviceability of the demonstrated problem scenarios is sufficient to avoid PMRs for any problems customers are likely to encounter, or that L2 should be able to quickly address those problems without need to engage L3? Answer: we have discussed with L2 (jjurece@us.ibm.com, vikramt@us.ibm.com). A demo and skill transfer to be scheduled for the 3rd week in June

  3. SVT -- SVT team is often the first team to try new features and often encounters problems setting up and using them. Note that we're not expecting SVT to do full serviceability testing -- just to sign-off on the serviceability of the problem paths they encountered. a) Who conducted SVT tests for this feature? Answer: Tam Dinh b) Do they agree that the serviceability of the problems they encountered is sufficient to avoid PMRs, or that L2 should be able to quickly address those problems without need to engage L3? Answer: yes we agree based on SVT early involvement in beta tests

  4. Which L2 / L3 queues will handle PMRs for this feature? Ensure they are present in the contact reference file and in the queue contact summary, and that the respective L2/L3 teams know they are supporting it. Ask Don Bourne if you need links or more info. Answer: The l3 team will be was-squad-osgi, the L2 queue will be WAS L2: ADM

  5. Does this feature add any new metrics or emit any new JSON events? If yes, have you updated the JMX metrics reference list / Metrics reference list / JSON log events reference list in the Open Liberty docs? Answer: No new metrics or JSON events.

follis commented 1 year ago

UFO Review May 24, 2023

Left off at chart 72 (hooks for Transactions)...to be continued.

chirp1 commented 1 year ago

David is completing the InstantOn topic at https://github.com/OpenLiberty/docs/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc . Approving this feature.

NottyCode commented 1 year ago

Approve proceeding to gold driver without FAT, SVT or globalization approval.

mtamboli commented 1 year ago

It is amazing that SVT team (@tam512 , @Jonathan-Maciel , @bconey) is able to finish all the testing planned for InstanOn feature in short duration of time. Thanks for putting all the extra effort and weekend work.

Testing was done on Open Liberty and WebSphere Liberty full and kernel images with Java 11 and Java 17. We tested checkpoint and restore on 3 Fyre VMs (AMD RHEL 9.0, AMD Ubuntu 22.04 and Intel Ubuntu 22.04). We deployed checkpoint application images on Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Service (AKS) using Open Liberty operator and WebSphere Liberty operator. We also tested Knative and run 24 hours stress test.

ayoho commented 1 year ago

I’ve removed the FAT approval from the feature after some internal discussions. We can discuss exact thresholds for granting approval again later, but for now Tom and his team have the action of getting tests up and running in more of our automated daily and personal builds. There should be a subset of existing SOE platforms that meet all of the requirements for InstantOn to be runnable (Linux, x86-64, Semeru, etc.), and Tom and company will likely be needing help to make the appropriate updates/installations to those environments for what they need. Once those tasks are done, we’ll have coverage in the main automated builds as well as the SOE. Beyond that, I’m not aware of anything that would prevent approval being re-granted.

LifeIsGood524 commented 1 year ago

any update on the outstanding FAT work?

tjwatson commented 1 year ago

Test machines are coming on line to run the InstantOn FATs in the SOE this weekend. If all goes well then I think we should get approval.

LifeIsGood524 commented 1 year ago

Any update on the outstanding FAT work and closing out this feature?

LifeIsGood524 commented 11 months ago

Still carrying this as technical debt. Is there a target for completion?

mbroz2 commented 11 months ago

There's been some recent meetings regarding this, and ownership of the work has been assigned . However, I'm not aware of a target completion for it.

LifeIsGood524 commented 9 months ago

Do we have an update on target completion?

mbroz2 commented 9 months ago

This work item has been long delivered and made publicly available, and as such I'm closing it.