Documentation for CRIU Project

dsouzai commented 2 years ago

This issue is to track the creation of various topics around which documentation in the form of markdown in the project or blogs over at https://blog.openj9.org/.

The following is a list, separated into high level categories, of some of the work that probably deserves mention in one or more blogs.

Basic Implementation

JAVA API

Security Considerations and Changes

https://github.com/eclipse-openj9/openj9/issues/12490
Reinitialization of security objects

Rootless CRIU

https://github.com/eclipse-openj9/openj9/issues/14265

Container Engine Considerations

Docker

Changes in Docker needed to pass CAP_CHECKPOINT_RESTORE
Running with --security-opt systempaths=unconfined --security-opt apparmor=unconfined in order to ensure docker mounts the proc filesystem as r/w.

Podman

OCP

Testing Environment Challenges

https://github.com/eclipse-openj9/openj9/issues/14016

RAS

https://github.com/eclipse-openj9/openj9/issues/14237

dsouzai commented 2 years ago

fyi @vijaysun-omr @tajila

vijaysun-omr commented 2 years ago

Thanks for getting this started @dsouzai. I agree that the list of items that you linked offers a good view into the areas we need to address. Getting into the specifics of what the hooks are being designed to handle, e.g. Random, Timers, environment variables and JCE and how they are each handled may be a further sub-topics under the "hook" main topic.

Taking a step back, I think we may need to establish the goal of the documentation clearly. In my mind, the goal is to ease a user of OpenJ9 or Java into this very different world of CRIU and snapshot/restore slowly so that they do not have to read N documents strewn all over the web about CRIU, the operating system details etc. In other words, we probably want to have a sequence and flow to the documentation that is along the lines of what a good discussion ought to follow : "why", "what", "when" and then finally the "how" of the topic.

Most of the items in your starting list above goes in the "how" part of the flow. In the "how" section, we need information clearly called out about "prerequisites" (supported OS versions, supported platforms etc.) and we also need to have a page on "limitations" as well as "what will behave differently" so that these are handy pages for a user to get the information that they might be interested in directly rather than derive it from the other articles that we have.

The "what" section probably needs thought about the topics we would mention there, i.e. we are offering an API and hooks for frameworks or other applications to use, but that is'nt going to be usable on its own, i.e. higher level code has to call into our API. This may be a point that perhaps gets made with a simple example program that you show calling in to our APIs to generate a snapshot and the command for how to restore from it. You could use the same example program to make various further points, i.e. why you need hooks or different arguments/versions of OS etc. as you make it more and more complicated, i.e. point out problems and then introduce the solution (link to the article that describes it) and how it allows you to proceed further with your small example.

vijaysun-omr commented 2 years ago

If you guys feel this sort of a flow would be good to have, then we may be able to divide up the work such that we focus on the main flow that works with the "why", "what" and the running example, and in parallel ask some of the folks who implemented the different features/hooks/fixups/prerequisites to describe their own problem/solution via standalone articles.

vijaysun-omr commented 2 years ago

A different organization could be as some set of "how to" articles, but I'll let you express your preferences.

dsouzai commented 2 years ago

I think the sort of standard layout of writing about the motivation (why) followed by description of the capability (what) followed by a simple example is probably a good first entry to have, especially as a landing point. This entry would be quite broadly consumable by all levels of the software stack. Possibly can also talk about when we expect to have this available for people to try out.

After that, it probably makes sense to have a single entry for a broad description of the implementation of this capability (how) but without getting too deep into the details. This would make it as consumable as possible because it's a high level description (to the extent possible) how the JVM implements this while possibly alluding to sort of the ultimate goal which is to have this work in the container workflow and as such follow the standard best practices .

The next logical step then is to focus on the various efforts to get rootless checkpoint/restore, from changes required both within the JVM and the environment (CRIU, Docker, etc.).

I think only after these three does it makes sense to have the various low level entries because at the very least, the previous three entries will have provided, to the broadest possible audience, the "why", "what", "how" and anyone then interested in the nitty gritty details is free to read the various blogs/entries that pop up over time. These could be, I suppose, organized as a set of "how to" articles, talking about the problem and how we decided to tackle it.

vijaysun-omr commented 2 years ago

Okay, that sounds like a practical approach. We need to get the broad articles done before the beta and then the rest roll in afterwards or in parallel (at least they are not as critical). @tajila there may be some text from disclosure documents and such that we have written internally that perhaps could be used to serve as a starting point for these broad articles ?

eclipse-openj9 / openj9