Closed tam512 closed 1 year ago
when having .spec.resources
in OpenLibertyApplication, restore failed
spec:
..........
resources:
limits:
cpu: 2000m
memory: 1Gi
requests:
cpu: 256m
memory: 400Mi
It appears part of the issue here is that the PID we use to checkpoint is having a collision on restore. Currently we ensure the Java pid being used to checkpoint is > 100. If we bump that to be > 2000 the problem goes away.
I'm going to transfer this issue over to ci.docker where the control is for the PID used to checkpoint.
I would have suggested we use the PID 299 792 458, but it seems that's larger then likely allowed :) Perhaps instead we could use 3108 (3x10^8)?
I would have suggested we use the PID 299 792 458, but it seems that's larger then likely allowed :) Perhaps instead we could use 3108 (3x10^8)?
The way we do this today is not ideal:
We loop over calling an external program pidplus.sh
to inflate the available pid before we invoke checkpoint
command that launches the java
instance. Bumping that from 100 to 2000 does introduce a noticeable delay (~6 seconds) before the checkpoint.sh
can proceed. For now I think we should try to bump that up to 1000
while we investigate a more efficient way to elevate the PID used for checkpoint.
PS - don't suggest writing to ns_last_pid
because that will not be available during a container image build where we want to do the checkpoint.
Verified that the problem is fixed using the following 23.0.0.9 OL and WL images and tested on P10 OCP 4.13
stg.icr.io/cp/olc/open-liberty:23.0.0.9-kernel-slim-java17-openj9-ubi-ppc64le@sha256:ade829b8de982c2647718d40f51e2734b1e6d57731f185d25f9f280fb2558685
stg.icr.io/cp/olc/open-liberty:23.0.0.9-kernel-slim-java11-openj9-ubi-ppc64le@sha256:917c9065d793e06a79e6953597e1b692b933a2f1c5f17e7c001ce8c19ab501c3
stg.icr.io/cp/olc/open-liberty:23.0.0.9-full-java17-openj9-ubi-ppc64le@sha256:328bb37a438095116e09d66d7d1f3413d4466e25e9837a97e6de8925d840845a
stg.icr.io/cp/olc/open-liberty:23.0.0.9-full-java11-openj9-ubi-ppc64le@sha256:c95c11995cfd104c07bc4d13c4da21014e3c96705eec06cf5562ffb4c83118fb
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-kernel-java17-openj9-ubi-ppc64le@sha256:924eda17ea7bb04efd1989115392b4a4904d2d835c5b6e0c50cbf71734be1d27
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-kernel-java11-openj9-ubi-ppc64le@sha256:842c86706bc57d5e201340abe47b0a85334cae3e0e56880cbd66723a3eee871a
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-full-java17-openj9-ubi-ppc64le@sha256:58495198cd619c756fb17388563666d02bcfdac64c6a712669e34a651aca2423
stg.icr.io/cp/wlc/websphere-liberty:23.0.0.9-full-java11-openj9-ubi-ppc64le@sha256:f98155922bb0c5e91697bc93c6c8b9d34c6e82ceb9ac80633994b2785f3e0423
On x86 OCP 4.13 cluster, deploy checkpoint app image on Liberty 23.0.0.9 and it failed with the following error in restore.log