OpenLiberty / ci.docker

Eclipse Public License 1.0
42 stars 59 forks source link

InstantOn app image fails to restore on OCP if deploy with /logs persistence #550

Open tam512 opened 3 weeks ago

tam512 commented 3 weeks ago

when we deploy InstantOn app image to OCP using the following statefulSet settings

apiVersion: liberty.websphere.ibm.com/v1
kind: WebSphereLibertyApplication
metadata:
    name: quicksec-jakarta
    labels:
      app: quicksec_wlo
spec:
  license:
    accept: true
  applicationImage: my_instanton_app_image
  statefulSet: 
    storage:
      size: 2Gi
      mountPath: "/logs"
..............

the server will fail to start with the following errors:

mkdir: cannot create directory ‘/logs/checkpoint’: Permission denied
/opt/ibm/wlp/bin/server: line 208: /logs/console.log: Permission denied
tail: cannot open '/logs/console.log' for reading: No such file or directory
tail: no files remaining
Error (criu/log.c:221): Can't create log file /logs/checkpoint/restore.log: No such file or directory
/opt/ibm/wlp/bin/server: line 1377: kill: (215) - No such process
CWWKE0961I: Restoring the checkpoint server process failed. Check the /logs/checkpoint/restore.log log to determine why the checkpoint process was not restored. Launching the server without using the checkpoint image.
JVMJ9GC063E Unable to open file '/logs/verbosegc.log' for writing
tam512 commented 3 weeks ago

This problem was discussed in slack https://ibm-cloud.slack.com/archives/C03MR7EC3NG/p1717621969960729?thread_ts=1717187652.885869&cid=C03MR7EC3NG

tjwatson commented 2 weeks ago

@tam512 do you see any logs persisted with a container image that doesn't use InstantOn in this setup? I see no way logs can be persisted if the mount of the /logs/ folder isn't writable by the user in the running container.

tjwatson commented 2 weeks ago

I can recreate outside of OCP if I mount a directory which is not writable by the user in the container.

When I attempt to start the same application image without the use of InstantOn then I get loads of output from verbose GC (because it cannot write to the verbosegc log and ends up logging to the standard out. I also don't see any messages or trace logs in the mounted logs directory.

For the OCP case to work the logs mount has to be writable by the user in the container. That does not appear to be the case here. I am unsure how you configure the deployment on OCP to have a writable mount here.

tam512 commented 2 weeks ago

I can see the logs persisted without instantOn. I just deployed the app image to OCP and I can see older logs with older time stamps

% oc exec quicksec-jakarta-0  -- ls -l /            
total 8
...........
drwxr-xr-x.   1 root    root       4096 Jun  4 12:53 licenses
drwxrwsr-x.   3 root    1000790000    7 Jun 18 19:41 logs
.............
drwxr-xr-x.   1 root    root         17 Jun  6 13:42 opt
lrwxrwxrwx.   1 root    root         33 Jun  4 12:54 output -> /opt/ibm/wlp/output/defaultServer
..........

% oc exec quicksec-jakarta-0  -- ls -l /logs
total 3088
drwxrws---. 2 1000790000 1000790000     366 Jun  4 15:35 ffdc
-rw-rw----. 1 1000790000 1000790000   65060 Jun  5 18:53 messages_24.06.18_19.41.57.0.log
-rw-r-----. 1 1000790000 1000790000   53270 Jun 18 19:42 messages.log
-rw-rw----. 1 1000790000 1000790000  161940 May 29 21:45 trace_24.05.30_21.50.28.0.log
-rw-rw----. 1 1000790000 1000790000  278480 May 30 21:56 trace.log
-rw-rw----. 1 1000790000 1000790000  294580 Jun 18 19:42 verbosegc.log
-rw-rw----. 1 1000790000 1000790000 2306837 May 24 21:14 verbosegc.log.001

% oc exec quicksec-jakarta-0  -- ls -l /logs/ffdc
total 23197
-rw-rw----. 1 1000790000 1000790000   1205 Jun  4 15:35 exception_summary_24.06.04_15.32.29.0.log
-rw-rw----. 1 1000790000 1000790000   1205 Jun  4 15:38 exception_summary_24.06.04_15.35.49.0.log
-rw-rw----. 1 1000790000 1000790000  30554 May 23 16:44 ffdc_24.05.23_16.44.08.0.log
...........
-rw-rw----. 1 1000790000 1000790000  36550 May 24 02:22 ffdc_24.05.24_02.22.08.0.log
-rw-rw----. 1 1000790000 1000790000  32494 May 24 02:22 ffdc_24.05.24_02.22.09.0.log
........
-rw-rw----. 1 1000790000 1000790000   6907 May 25 11:08 ffdc_24.05.25_11.08.22.0.log
-rw-rw----. 1 1000790000 1000790000 129314 May 28 20:41 ffdc_24.05.28_20.41.27.0.log
.......
-rw-rw----. 1 1000790000 1000790000 129334 May 30 23:43 ffdc_24.05.30_23.43.39.0.log
-rw-rw----. 1 1000790000 1000790000   6906 May 31 00:07 ffdc_24.05.31_00.07.58.0.log
.........
-rw-rw----. 1 1000790000 1000790000  87754 Jun  4 15:15 ffdc_24.06.04_15.15.51.0.log
tjwatson commented 2 weeks ago

I just deployed the app image to OCP and I can see older logs with older time stamps

If you just deployed the app image right now to OCP I would expect new logs with today's timestamp.

tam512 commented 2 weeks ago

sorry I was not clear about the logs. I saw today's log (Jun 18) and older logs

-rw-rw----. 1 1000790000 1000790000 65060 Jun 5 18:53 messages_24.06.18_19.41.57.0.log -rw-r-----. 1 1000790000 1000790000 53270 Jun 18 19:42 messages.log

tjwatson commented 1 week ago

I suspect the issue may be the change of the security context used when deploying the instanton image. Something has to be different there that causes the mounted /logs folder to be read-only.

tam512 commented 1 week ago

Below is the SecurityContextConstraints I used. Do I need to update anything to make it writeable to /logs

# https://docs.openshift.com/container-platform/4.14/authentication/managing-security-context-constraints.html
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: criu-scc is based on the restricted SCC but removes any restrictions
      that prevent the restore image from running successfully.
  generation: 1
  name: cap-cr-scc
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities:
- CHECKPOINT_RESTORE
- SETPCAP
fsGroup:
  type: RunAsAny
groups:
- system:authenticated
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: RunAsAny
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
tjwatson commented 1 week ago

Maybe @leochr knows. I'm not sure if you can compare with what the effective one is for normal apps.

leochr commented 1 week ago

In OpenShift restricted-v2 is used as the SecurityContextConstraints (SCC) by default. Comparing the above SCC to it, I couldn't find any difference that could explain the storage permission issue. Suggest to try the above SCC cap-cr-scc with a non-instanton app.

tam512 commented 6 days ago

@leochr I also see the permission errors when using the above SCC with non-instanton app, but the app pod says it's running and the server actually started.

quicksec-jakarta-0 1/1 Running 0 6m42s

........
JVMJ9GC063E Unable to open file '/logs/verbosegc.log' for writing
.............
[6/26/24, 21:12:31:378 UTC] 0000002e org.infinispan.HOTROD                                        I ISPN004021: Infinispan version: Infinispan 'Flying Saucer' 14.0.29.Final
TRAS0036E: The system could not create file /logs/messages.log because of the following exception: java.security.PrivilegedActionException: java.io.IOException: Permission denied
...............
[6/26/24, 21:12:35:595 UTC] 0000002b com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 7.904 seconds.
[6/26/24, 21:12:35:595 UTC] 0000002b com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 9.782 seconds.

When I try to copy the app logs, it also throws permission errors

% oc cp quicksec-jakarta-0:/logs quicksec-jakarta-0/      
tar: Removing leading `/' from member names
tar: /logs/verbosegc.log.001: Cannot open: Permission denied
tar: /logs/messages.log: Cannot open: Permission denied
tar: /logs/trace.log: Cannot open: Permission denied
tar: /logs/verbosegc.log: Cannot open: Permission denied
tar: /logs/ffdc: Cannot open: Permission denied
tar: /logs/trace_24.05.30_21.50.28.0.log: Cannot open: Permission denied
tar: /logs/messages_24.06.25_20.41.29.0.log: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors

I posted the full log in slack here https://ibm-cloud.slack.com/archives/CMM0D6JHF/p1719437551545339