OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.16k stars 597 forks source link

InstantOn failure reports a file /logs/checkpoint/restore.log that doesn't exist #23664

Open kgibm opened 1 year ago

kgibm commented 1 year ago

Describe the bug

Running an InstantOn image in OpenShift without proper privileges shows the CWWKE0957I error but the referenced /logs/checkpoint/restore.log file does not exist.

Steps to Reproduce

  1. oc create deployment libertydiag --image=quay.io/ibm/libertydiag:instanton
  2. Find the pod:
    $ oc get pods
    NAME                           READY   STATUS    RESTARTS   AGE
    libertydiag-596cc64bdd-rdj62   1/1     Running   0          12m
  3. Print pod logs and observe CWWKE0957I:
    $ oc logs libertydiag-596cc64bdd-rdj62 | grep CWWKE0957I
    CWWKE0957I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
  4. However, listing the specified directory does not show restore.log:
    $ oc exec libertydiag-596cc64bdd-rdj62 -- ls -l /logs/checkpoint/
    total 8
    -rw-------. 1 default root 1657 Dec  5 21:13 checkpoint.log
    -rw-r-----. 1 default root   54 Dec  5 21:13 stats-dump

Expected behavior

/logs/checkpoint/restore.log should exist with details of the InstantOn restore failure.

Diagnostic information:

Additional context

@tjwatson fyi

tjwatson commented 1 year ago

I don't think this should have a release bug for something that is still in beta.

Did you have any additional message about criu not having permissions?

tjwatson commented 1 year ago

@sebratton Could you have a look. The same thing should happen if you run in-container without specifying the necessary --add-cap options.

kgibm commented 1 year ago

Did you have any additional message about criu not having permissions?

$ oc logs libertydiag-596cc64bdd-rdj62

/opt/ol/wlp/bin/server: line 1458: /opt/ol/wlp/output/defaultServer/workarea/checkpoint/restoreTime: Permission denied
/opt/ol/wlp/bin/server: line 1474: /opt/ol/wlp/output/defaultServer/workarea/checkpoint/.env.properties: Permission denied
/opt/ol/wlp/bin/server: line 1413: /usr/sbin/criu: Operation not permitted
CWWKE0957I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
JVMSHRC226E Error opening shared class cache file
JVMSHRC336E Port layer error code = -102
JVMSHRC337E Platform error message: Permission denied
JVMSHRC840E Failed to start up the shared cache.
JVMSHRC686I Failed to startup shared class cache. Continue without using it as -Xshareclasses:nonfatal is specified
Launching defaultServer (Open Liberty 22.0.0.13-beta/wlp-1.0.71.cl221220221107-1900) on Eclipse OpenJ9 VM, version 17.0.5-ea+8 (en_US)
kgibm commented 1 year ago

Confirmed similar results with podman:

  1. Launch container:

    $ podman run --rm quay.io/ibm/libertydiag:instanton
    
    /opt/ol/wlp/bin/server: line 1413: /usr/sbin/criu: Operation not permitted
    CWWKE0957I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
    Launching defaultServer (Open Liberty 22.0.0.13-beta/wlp-1.0.71.cl221220221107-1900) on Eclipse OpenJ9 VM, version 17.0.5-ea+8 (en_US) [...]
  2. In new window:
    $ podman exec -it $(podman ps -q) ls -l /logs/checkpoint/
    total 8
    -rw-------. 1 default root 1657 Dec  5 21:13 checkpoint.log
    -rw-r-----. 1 default root   54 Dec  5 21:13 stats-dump
tjwatson commented 1 year ago

/opt/ol/wlp/bin/server: line 1458: /opt/ol/wlp/output/defaultServer/workarea/checkpoint/restoreTime: Permission denied /opt/ol/wlp/bin/server: line 1474: /opt/ol/wlp/output/defaultServer/workarea/checkpoint/.env.properties: Permission denied

These are weird, we should always have write access to the work area. Is there something that makes the container file system read-only here? I would think that would cause "normal" liberty application containers to fail.

/opt/ol/wlp/bin/server: line 1413: /usr/sbin/criu: Operation not permitted

This is the error I was expecting from trying to invoke criu without the necessary capabilities.

kgibm commented 1 year ago

Is there something that makes the container file system read-only here?

Probably related to folder permissions. The user running the container is 1000830000 but the checkpoint folder is owned by default:root. I'm not sure where 1000830000 is coming from as I didn't specify it.

$ oc exec libertydiag-596cc64bdd-rdj62 -- whoami
1000830000
$ oc exec libertydiag-596cc64bdd-rdj62 -- grep -e default -e 1000830000 /etc/passwd
default:x:1001:0::/home/default:/usr/sbin/nologin
1000830000:x:1000830000:0:1000830000 user:/:/sbin/nologin
$ oc exec libertydiag-596cc64bdd-rdj62 -- ls -l /opt/ol/wlp/output/defaultServer/workarea/
total 20
drwxr-x---. 1 default    root    22 Dec  5 21:13 checkpoint
-rw-r-----. 1 1000830000 root 15004 Dec 12 20:29 equinox.log
drwxrwx---. 1 default    root  4096 Dec 12 20:31 org.eclipse.osgi
drwxr-x---. 3 1000830000 root   117 Dec 12 20:30 platform

Running the vanilla container shows it running with the default user:

$ podman run --rm quay.io/ibm/libertydiag:instanton whoami
default
tjwatson commented 1 year ago

@jgawor do you know what is going on here with the users on OCP?