OpenLiberty / docs

See Open Liberty documentation on https://openliberty.io/docs/
https://openliberty.io/docs/
Other
13 stars 47 forks source link

Documentation for Liberty Checkpoint - InstantOn restore server process #5983

Closed dazavala closed 1 year ago

dazavala commented 2 years ago

For epic https://github.com/OpenLiberty/open-liberty/issues/16384. Opened in accordance to point 6 of Documenting Open Liberty.

Create topics in openliberty.io for the new checkpoint server command and for InstantOn development.

The discovery for documentation requirements was held Nov11, and openliberty.io requirements are summarized below. InstantOn also requires guide documentation that will publish in openliberty.io.

Documentation for InstantOn in-container usage may require updates to the DockerHub Open Liberty and IBM Container Repository documentation. Separate issues will be opened for DockerHub and ICR.

dazavala commented 2 years ago

Summary of openliberty-io user documentation discovery (nov11)

dmuelle commented 1 year ago

currently targeting to align with 23.0.0.6 to GA

malincoln commented 1 year ago

@tam512 please add details regarding testing with Docker 23.0.0 that would need to be doc'd per discussion on scrum call. Thanks. cc'ing @tjwatson

tam512 commented 1 year ago

docker build does not have --cap-add flag so we can not build checkpoint image the convenient way of having RUN checkpoint.sh applications in the Containerfile or Dockerfile as document in 23.0.0.2-beta blog. To build checkpoint image with docker, we need to use the 3 steps - build, checkpoint, commit as document in the first beta blog

tjwatson commented 1 year ago

There is more investigation needed to see about using docker buildx to do more advanced things during a container build. One thing is to be able to specify the necessary capabilities to do a checkpoint during the container build in one step. I did look at doing this and it seemed the checkpoint would succeed, but the resulting image could not restore properly. This needs more investigation, but right now it is not a priority for the initial release.

I recommended that we document the 3 step process for building an InstantOn application image in general for both podman and docker builds. The 3 step process will work for both. Then we can have another section that describes the single step process with using checkpoint.sh script in a Dockerfile, but limit that documentation to indicate it is only supported with podman builds.

tam512 commented 1 year ago

Need to document the necessary sys call to do restore. --security-opt seccomp=unconfined is the easy way to grant all sys calls, but explicitly granting only the required sys calls is better than opening up access to all sys calls.
@tjwatson can help with this.

tam512 commented 1 year ago

Document trouble shooting for checkpoint failure on SELinux per issue 24522. @ymanton please help with this. Thanks!

dmuelle commented 1 year ago

Draft docs ready for technical review:

Questions

tjwatson commented 1 year ago
  • The checkpoint command isn't mentioned in the main documentation at all. What is the strategy for this command? How would anyone know how to integrate it with the information in the main topic? I notice Dave's comment about needing to decide whether or not it will be external. If the decision is yes, I think we need to provide more information about use case, how to, etc.

For in container usage, the checkpoint command is an implementation detail behind the checkpoint.sh script that gets run when doing an InstantOn application container build. We could reference it from the section https://docs-draft-openlibertyio.mqj6zf7jocq.us-south.codeengine.appdomain.cloud/docs/latest/instanton.html#build Building an InstantOn application image after the two bullets that describe the two options for building InstantOn. Something like this:

Both options use the xref...server-checkpoint.adoc[checkpoint] server command to perform the checkpoint during InstantOn application container image build.

  • The documentation refers to both CRIU and InstantOn in a seemingly interchangeable manner in some sections. eg, "CRIU cannot perform a checkpoint..." or "InstantOn makes a checkpoint..." Does the user need to worry about CRIU beyond how it is explained as an enabling technology in the introduction? Could subsequent references to checkpoint/restore processes just say "InstantOn" ?

For the most part we can replace CRIU with InstantOn I think. There are some exceptions.

Some rewording will be needed if you do this. For example:

Other public cloud Kubernetes services might also work if they have the prerequisites to allow InstantOn to restore the InstantOn application process.

Seems awkward to have two InstantOn in that sentence. Perhaps something like this:

Other public cloud Kubernetes services might also work if they have the prerequisites to allow the InstantOn application process to restore.

  • Should 23.0.0.6 be explicitly documented as a prerequisite minimum OL version in the prereq section? It is mentioned but not listed among other host prereqs. Same question with Semeru Java 11/17

The host prerequisites was intended to list the prereqs required to run our InstantOn enabled Open Liberty container images. That is the images described in the paragraph before the bulleted list. That paragraph indicates the InstantOn support images start at 23.0.0.6 which are based on the Semeru runtime. It also mentions Java 11/17 semeru versions there. It would seem redundant to me to indicate these again as host system requirements. Besides, I don't see this as host system pre-reqs because it is just what is in the images themselves, not what is on the actual host system running the images.

dmuelle commented 1 year ago

For in container usage, the checkpoint command is an implementation detail behind the checkpoint.sh script that gets run when doing an InstantOn application container build.

Is there any use case where someone would run it from the CLI? Or need the doc to customize the script somehow? I dont think we necessarily need to mention it on the main page, but the command documentation could be misleading if we don't make it clear that the command is only used by the script (if that's the case). OTOH if it's purely an implementation detail, should we be documenting it externally at all?

tjwatson commented 1 year ago

Is there any use case where someone would run it from the CLI? Or need the doc to customize the script somehow? I dont think we necessarily need to mention it on the main page, but the command documentation could be misleading if we don't make it clear that the command is only used by the script (if that's the case). OTOH if it's purely an implementation detail, should we be documenting it externally at all?

For this release we can consider it an implementation detail of the InstantOn container build for checkpoint. With that in mind I think we could just omit the command page for checkpoint for now.

dmuelle commented 1 year ago

Thanks @tjwatson - I made the following updates per your repsonses:

Fast startup with InstantOn InstantOn system calls InstantOn limitations and known issues

Let me know if any further edits are needed. When you're satisfied with the drafts, you can add the technical reviewed label to this issue and I'll send it for ID peer review to prepare for publishing with 23.0.0.6. Thanks

tam512 commented 1 year ago

@tjwatson I have some questions about the Fast startup with InstantOn documentation

  1. Supported processors and Runtime and host build system prerequisites

Currently, the only supported processor is X86-64/AMD64. Other processors are expected to be supported in later releases of Open Liberty InstantOn.

We tested checkpoint and restore on VM with the following Intel processor, but we only claim support on AMD64?

cat /proc/cpuinfo | grep -i model model : 85 model name : Intel Xeon Processor (Skylake, IBRS)

  1. Runtime and host build system prerequisites

Currently, InstantOn is supported with the IBM Semeru Java version 11.0.9+ and IBM Semeru Java version 17.0.7+

When testing with Java11 image, I see Javaversion 11.0.19+7 as following, I just want to confirm that we support InstantOn on Java 11.0.9+ or 11.0.19+

Launching defaultServer (WebSphere Application Server 23.0.0.6/wlp-1.0.78.cl230620230608-1100) on Eclipse OpenJ9 VM, version 11.0.19+7 (en_US)
  1. Do we need to mention that beforeAppStart or afterAppStart checkpoint location is not case sensitive?

  2. Deploying an InstantOn application to Kubernetes services

When testing on AKS and EKS, I also have securityContext allowPrivilegeEscalation: true but I do not see it listed in the doc, so do we need it?

tam512 commented 1 year ago

Regarding (4), we need allowPrivilegeEscalation: true when deploy checkpoint application images on AKS and EKS otherwise we will get error

CRIU needs to have the CAP_SYS_ADMIN or the CAP_CHECKPOINT_RESTORE capability: 
setcap cap_checkpoint_restore+eip /opt/criu/criu
CWWKE0964E: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. The server did not launch because checkpoint restore recovery is disabled.
tjwatson commented 1 year ago

We tested checkpoint and restore on VM with the following Intel processor, but we only claim support on AMD64?

The doc always refers to both X86-64 and AMD64 with a slash (e.g. X86-64/AMD64). By and large, you can think of the two as aliases to each other. Technically speaking AMD provided the architectural design of AMD64 which was originally an extension of the x86 architecture. It then began being referred to as X86-64 also. Both Intel and AMD implement chipsets that follow the architectural design (see https://en.wikipedia.org/wiki/X86-64 for more context).

When testing with Java11 image, I see Javaversion 11.0.19+7 as following, I just want to confirm that we support InstantOn on Java 11.0.9+ or 11.0.19+

You are correct, looks like 11.0.9+ was a typo and should be 11.0.19+

Do we need to mention that beforeAppStart or afterAppStart checkpoint location is not case sensitive?

We could, I don't personally think it is required to document that though.

When testing on AKS and EKS, I also have securityContext allowPrivilegeEscalation: true but I do not see it listed in the doc, so do we need it?

Good point, we should have that documented. For completeness, can you show us your complete securityContext sections. Or better yet the delta of your deployment yaml you use for InstantOn vs a "normal" liberty application deployment.

tam512 commented 1 year ago

This is the complete securityContext when testing with InstantOn. Without InstantOn, we do not need to specify securityContext section

spec:
  ...........

  securityContext:
    allowPrivilegeEscalation: true
    privileged: false
    runAsNonRoot: true
    capabilities:
      add:
      - CHECKPOINT_RESTORE
      - SETPCAP
      drop:
      - ALL
dmuelle commented 1 year ago

@ramkumar-k-9286 - this issue is ready for peer review:

Fast startup with InstantOn InstantOn system calls InstantOn limitations and known issues

ramkumar-k-9286 commented 1 year ago

Peer Review (Fast startup with InstantOn)

InstantOn is not intended to be used outside of a container image build. -> (Acrolinx Suggestion) Do not use InstantOn outside of a container image build.


This configuration ensures that the resources in the lower layers of the image do not change from the time the checkpoint is taken to the time the image is started with InstantOn. -> (Acrolinx - Accessibility) This configuration ensures that the resources in the underlying layers of the image do not change from the time the checkpoint is taken to the time the image is started with InstantOn.


Which of these options you choose depends on the kind of code your application must run. -> (acrolinx suggestion) Which of these options you choose depends on the code your application must run.


The following examples assume you are using Docker to build an application image that is named liberty-app. --> The following examples assume that you are using Docker to build an application image that is named liberty-app.


Jakarta EE and MicroProfile applications might contain application code that gets run as the application is started, such as the following examples: -> Should we be adding links for Jakarta EE and MicroProfile here? 2nd mention of both after the short desc.


Add periods for the following bulleted list. Similar list before and after have periods.


In some cases, the application code that runs as the application starts might not be suited for performing an InstantOn checkpoint. -> (acrolinx suggestion) Sometimes the application code that runs as the application starts might not be suited for performing an InstantOn checkpoint.


Reading configuration that is expected to change when the application is deployed, for example configuration from MicroProfile Config. -> A reading configuration that is expected to change when the application is deployed. For example, configuration from MicroProfile Config.


This option might result in slower restore times because it must run more code before the application is ready to service incoming requests. -> This option might result in slower restore times because it must run more code before the application is ready to service any incoming requests.


For more information about limitations with early startup code annd possible workarounds, see InstantOn limitations and known issues. -> For more information about limitations with early startup code and possible workarounds, see InstantOn limitations and known issues.


Starting with Open Liberty version 23.0.0.6, all X86-64/AMD64 UBI Open Liberty container images include the prerequisites for InstantOn to checkpoint and restore Open Liberty application processes. -> Starting with Open Liberty version 23.0.0.6, all X86-64/AMD64 UBI Open Liberty container images include the prerequisites for InstantOn to checkpoint and restoring Open Liberty application processes.


Currently, InstantOn is supported with the IBM Semeru Java version 11.0.19+ and IBM Semeru Java version 17.0.7+. InstantOn is expected to support new versions of IBM Semeru Java as they are released. -> Currently, InstantOn is supported by IBM Semeru Java version 11.0.19+ and IBM Semeru Java version 17.0.7+. InstantOn is expected to support new versions of IBM Semeru Java as they are released.


CHECKPOINT_RESTORE - This capability was added in Linux 5.9 to separate out checkpoint/restore functions from the overloaded SYS_ADMIN capability. -> CHECKPOINT_RESTORE - This capability was added in Linux 5.9 to separate checkpoint/restore functions from the overloaded SYS_ADMIN capability.


The following examples assume you are using Docker to build an application image that is named liberty-app. -> The following examples assume that you are using Docker to build an application image that is named liberty-app.


Starting a container with the liberty-app-instanton container image shows a much faster startup time than the original liberty-app image. -> Starting a container with the liberty-app-instanton container image shows a faster startup time than the original liberty-app image.


If restoration of the InstantOn application process fails, Open Liberty launches the server without using the InstantOn checkpoint process. -> If restoration of the InstantOn application process fails, Open Liberty starts the server without using the InstantOn checkpoint process.


In such cases, the Open Liberty application starts as if no InstantOn checkpoint process layer exists, which takes significantly longer than a successfully restored InstantOn process. -> In such cases, the Open Liberty application starts as if no InstantOn checkpoint process layer exists, which takes longer than a successfully restored InstantOn process.


ramkumar-k-9286 commented 1 year ago

Peer Review (InstantOn system calls)

No comments.

ramkumar-k-9286 commented 1 year ago

Peer Review (InstantOn limitations and known issues)

For more information about InstantOn prerequisties, see Runtime and host build system prerequisites. --> For more information about InstantOn prerequisites, see Runtime and host build system prerequisites.

--

If this @Inject annotation of the configuration is contained in a CDI bean that is created and used before the checkpoint is performed, the value of "theDefault" is injected. -> Should theDefault be in " " ?


This configuration allows the values to be updated with environment variable values or other configuration mechanisms, as described in Configuring microservices running in Kubernetes. -> link not working


InstantOn supports only a subset of Open Liberty features, as described in Open Liberty InstantOn supported features. -> Link working - but not redirected to #supported-feature on the linked page.


When an InstantOn application container image is run the bootstrap.properties file is not read. -> When an InstantOn application container image is run, the bootstrap.properties file is not read.

For example, you might use environment variables or other configuration mechanisms, as described Configuring microservices running in Kubernetes. -> link not working


If Yama is configured with one of the following modes, InstantOn cannot checkpoint or restore the application process in running containers:

2 - admin-only attach

3 - no attach -> If Yama is configured with one of the following modes, InstantOn cannot checkpoint or restore the application process in running containers:

2 - admin-only attach

3 - no attach


For InstantOn checkpoint and restore to work, Yama must be configured with one of the following modes:

0 - classic ptrace permissions

1 - restricted ptrace -> For InstantOn checkpoint and restore to work, Yama must be configured with one of the following modes:

0 - classic ptrace permissions

1 - restricted ptrace


As described in Required Linux system calls, CRIU requires a number of Linux system calls to restore the application process. -> As described in Required Linux system calls, CRIU requires several Linux system calls to restore the application process.


Amazon Elastic Kubernetes Service (EKS) Azure Kubernetes Service (AKS) These links are provided multiple times in the same page - required?


dmuelle commented 1 year ago

Thanks for reviewing @ramkumar-k-9286 - all suggestions implemented except:

Add periods for the following bulleted list. Similar list before and after have periods.

A servlet that uses the loadOnStartup attribute An EJB that uses the @Startup annotation A CDI bean that uses @Observes @Initialized(ApplicationScoped.class) annotations - Also are these 2 separate items? @Observes and @Initialized(ApplicationScoped.class) - because there seems to be a space after @Observes

The previous list had a mix of sentences and fragments, which requires periods for the items. This list is only fragments, so no periods are needed. W/r/t the annotations- they are separate annotations, but used in conjunction in this context.

Reading configuration that is expected to change when the application is deployed, for example configuration from MicroProfile Config. -> A reading configuration that is expected to change when the application is deployed. For example, configuration from MicroProfile Config.

This is correct as is- all items in this list begin with verb phrases that describe app scenarios "Reading configuration that is expected to change..." is a verb phrase, not a compound noun.

Starting with Open Liberty version 23.0.0.6, all X86-64/AMD64 UBI Open Liberty container images include the prerequisites for InstantOn to checkpoint and restoring Open Liberty application processes.

present participle (restoring) doesn't agree with infinitive verb clause (to checkpoint...). "Checkpoint and restore.." is a compound verb phrase used throughout the doc, with precedent in the Linux source doc.

Currently, InstantOn is supported by IBM Semeru Java version 11.0.19+ and IBM Semeru Java version 17.0.7+. InstantOn is expected to support new versions of IBM Semeru Java as they are released.

"with" is more correct in this context, as the intention is to say that iOn is supported when used in conjunction with Java 11/17.

Configuring microservices running in Kubernetes -> this link goes to the guides, which aren't avaialble from docs draft site. I confirmed it works from the main draft site.

Let me know if any further changes are needed. Thanks

Fast startup with InstantOn InstantOn system calls InstantOn limitations and known issues

ramkumar-k-9286 commented 1 year ago

Review (Fast startup with InstantOn)

You can use these steps with either Podman and Docker to build an Instanton application image. -> You can use these steps with either Podman or Docker to build an InstantOn application image.


In addition to the features that are enabled in the convenience features, InstantON also supports the following features: -> In addition to the features that are enabled in the convenience features, InstantOn also supports the following features:


LINK:https://jakarta.ee/[Jakarta EE] and LINK:https://microprofile.io/[MicroProfile] applications might contain application code that gets run as the application is started, such as the following examples: -> doc to be fixed - link showing up.


dmuelle commented 1 year ago

Thanks for catching those @ramkumar-k-9286 - all fixed

https://docs-draft-openlibertyio.mqj6zf7jocq.us-south.codeengine.appdomain.cloud/docs/latest/instanton.html

dmuelle commented 1 year ago

content is on vNExt and will publish with 23.0.0.6