OpenLiberty / ci.docker

Eclipse Public License 1.0
43 stars 59 forks source link

Fail to restore checkpoint app image on OCP 4.13 when having spec. resources in OpenLibertyApplication #448

Closed tam512 closed 1 year ago

tam512 commented 1 year ago

On x86 OCP 4.13 cluster, deploy checkpoint app image on Liberty 23.0.0.9 and it failed with the following error in restore.log

% oc exec ebuy-inston-74575f6bd5-pzbx7 -- cat /logs/checkpoint/restore.log
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Error (criu/libnetlink.c:84): Can't send request message: Permission denied
Warn  (criu/kerndat.c:1103): $XDG_RUNTIME_DIR not set. Cannot find location for kerndat file
Error (criu/pstree.c:379): Current sid 166 intersects with pid (129) in images
tam512 commented 1 year ago

when having .spec.resources in OpenLibertyApplication, restore failed

spec:
..........
  resources: 
    limits:
      cpu: 2000m
      memory: 1Gi
    requests:
      cpu: 256m
      memory: 400Mi
tjwatson commented 1 year ago

It appears part of the issue here is that the PID we use to checkpoint is having a collision on restore. Currently we ensure the Java pid being used to checkpoint is > 100. If we bump that to be > 2000 the problem goes away.

I'm going to transfer this issue over to ci.docker where the control is for the PID used to checkpoint.

mbroz2 commented 1 year ago

I would have suggested we use the PID 299 792 458, but it seems that's larger then likely allowed :) Perhaps instead we could use 3108 (3x10^8)?

tjwatson commented 1 year ago

I would have suggested we use the PID 299 792 458, but it seems that's larger then likely allowed :) Perhaps instead we could use 3108 (3x10^8)?

The way we do this today is not ideal:

https://github.com/OpenLiberty/ci.docker/blob/faefa763efc9b42164b7dd4afaad402ada299432/releases/latest/full/helpers/build/checkpoint.sh#L4-L7

We loop over calling an external program pidplus.sh to inflate the available pid before we invoke checkpoint command that launches the java instance. Bumping that from 100 to 2000 does introduce a noticeable delay (~6 seconds) before the checkpoint.sh can proceed. For now I think we should try to bump that up to 1000 while we investigate a more efficient way to elevate the PID used for checkpoint.

PS - don't suggest writing to ns_last_pid because that will not be available during a container image build where we want to do the checkpoint.

tam512 commented 1 year ago

Verified that the problem is fixed using the following 23.0.0.9 OL and WL images and tested on P10 OCP 4.13

stg.icr.io/cp/olc/open-liberty:23.0.0.9-kernel-slim-java17-openj9-ubi-ppc64le@sha256:ade829b8de982c2647718d40f51e2734b1e6d57731f185d25f9f280fb2558685
stg.icr.io/cp/olc/open-liberty:23.0.0.9-kernel-slim-java11-openj9-ubi-ppc64le@sha256:917c9065d793e06a79e6953597e1b692b933a2f1c5f17e7c001ce8c19ab501c3
stg.icr.io/cp/olc/open-liberty:23.0.0.9-full-java17-openj9-ubi-ppc64le@sha256:328bb37a438095116e09d66d7d1f3413d4466e25e9837a97e6de8925d840845a
stg.icr.io/cp/olc/open-liberty:23.0.0.9-full-java11-openj9-ubi-ppc64le@sha256:c95c11995cfd104c07bc4d13c4da21014e3c96705eec06cf5562ffb4c83118fb

stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-kernel-java17-openj9-ubi-ppc64le@sha256:924eda17ea7bb04efd1989115392b4a4904d2d835c5b6e0c50cbf71734be1d27
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-kernel-java11-openj9-ubi-ppc64le@sha256:842c86706bc57d5e201340abe47b0a85334cae3e0e56880cbd66723a3eee871a
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-full-java17-openj9-ubi-ppc64le@sha256:58495198cd619c756fb17388563666d02bcfdac64c6a712669e34a651aca2423
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-full-java11-openj9-ubi-ppc64le@sha256:f98155922bb0c5e91697bc93c6c8b9d34c6e82ceb9ac80633994b2785f3e0423