adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
131 stars 310 forks source link

PingPerf thread Checkpoint failed on JDK17 linux aarch64 #4632

Closed llxia closed 1 year ago

llxia commented 1 year ago
01:02:21.458  "/internaljdk/java/openjdk/bin/java" -version
01:02:21.458  java version "17.0.8-beta" 2023-07-18
01:02:21.458  IBM Semeru Runtime Certified Edition 17.0.8+6-202306171433 (build 17.0.8-beta+6-202306171433)
01:02:21.458  Eclipse OpenJ9 VM 17.0.8+6-202306171433 (build master-a09876e49, JRE 17 Linux aarch64-64-Bit Compressed References 20230617_423 (JIT enabled, AOT enabled)
01:02:21.458  OpenJ9   - a09876e49
01:02:21.458  OMR      - e09e7b5f1
01:02:21.458  JCL      - c723411f39 based on jdk-17.0.8+6)

On ub22, thread Checkpoint failed: Internal build

00:59:10.821  Launching defaultServer (Open Liberty 23.0.0.7-beta/wlp-1.0.78.cl230620230612-1100) on Eclipse OpenJ9 VM, version 17.0.7+7 (en_US)
00:59:10.821  CWWKE0953W: This version of Open Liberty is an unsupported early release version.
00:59:16.292  [AUDIT   ] CWWKE0001I: The server defaultServer has been launched.
00:59:41.233  [AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
00:59:41.233  [AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
01:00:06.032  [ERROR   ] CWWKF0001E: A feature definition could not be found for checkpoint-1.0
01:00:14.358  [AUDIT   ] CWWKZ0058I: Monitoring dropins for applications.
01:00:18.272  [WARNING ] CWOWB1009W: Implicit bean archives are disabled.
01:01:08.491  [AUDIT   ] CWWKC0451I: A server checkpoint "afterAppStart" was requested. When the checkpoint completes, the server stops.
01:01:08.491  [AUDIT   ] CWWKZ0022W: Application pingperf has not started in 30.028 seconds.
01:01:08.491  [ERROR   ] CWWKC0453E: The server checkpoint request failed with the following message: CWWKC0457E: An error occurred preparing to take a checkpoint with the following message: CWWKZ0022W: Application pingperf has not started in 29.948 seconds.
01:01:08.491  [AUDIT   ] CWWKE0084I: The server defaultServer is stopping because thread Checkpoint failed, exiting... (0000005a) called the method java.lang.System.exit: 
01:01:08.491    at java.base/java.lang.System.exit(System.java:502)
01:01:08.491    at io.openliberty.checkpoint.internal.CheckpointImpl.lambda$checkpointOrExitOnFailure$1(CheckpointImpl.java:266)
01:01:08.491    at java.base/java.lang.Thread.run(Thread.java:857)
01:01:08.491  
01:01:08.491  [AUDIT   ] CWWKF0012I: The server installed the following features: [cdi-3.0, concurrent-2.0, jndi-1.0, jsonp-2.0, restfulWS-3.0, restfulWSClient-3.0, servlet-5.0].
01:01:08.491  [AUDIT   ] CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 131.953 seconds.
01:01:08.491  [AUDIT   ] CWWKE1100I: Waiting for up to 30 seconds for the server to quiesce.
01:01:28.952  [AUDIT   ] CWWKZ0001I: Application pingperf started in 54.948 seconds.
01:01:29.368  [AUDIT   ] CWWKZ0009I: The application pingperf has stopped successfully.
01:01:36.059  -----------------------------------
01:01:36.059  criu_pingPerf_testCreateRestoreImageAndPushToRegistry_0_FAILED
01:01:36.059  -----------------------------------

On rhel9, exec container process/bin/sh: Exec format error: internal build

00:23:02.336  [2/2] STEP 11/23: RUN set -eux;     ARCH="$(uname -m)";     case "${ARCH}" in        aarch64|arm64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_aarch64';          DUMB_INIT_SHA256=b7d648f97154a99c539b63c55979cd29f005f88430fb383007fe3458340b795e;          ;;        amd64|x86_64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64';          DUMB_INIT_SHA256=e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df;          ;;        ppc64el|ppc64le)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_ppc64le';          DUMB_INIT_SHA256=3d15e80e29f0f4fa1fc686b00613a2220bc37e83a35283d4b4cca1fbd0a5609f;          ;;        s390x)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_s390x';          DUMB_INIT_SHA256=47e4601b152fc6dcb1891e66c30ecc62a2939fd7ffd1515a7c30f281cfec53b7;          ;;       *)          echo "Unsupported arch: ${ARCH}";          exit 1;          ;;     esac;     curl -LfsSo /usr/bin/dumb-init ${DUMB_INIT_URL};     echo "${DUMB_INIT_SHA256} */usr/bin/dumb-init" | sha256sum -c -;     chmod +x /usr/bin/dumb-init;
00:23:02.336  exec container process `/bin/sh`: Exec format error
00:23:02.337  Error: building at STEP "RUN set -eux;     ARCH="$(uname -m)";     case "${ARCH}" in        aarch64|arm64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_aarch64';          DUMB_INIT_SHA256=b7d648f97154a99c539b63c55979cd29f005f88430fb383007fe3458340b795e;          ;;        amd64|x86_64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64';          DUMB_INIT_SHA256=e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df;          ;;        ppc64el|ppc64le)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_ppc64le';          DUMB_INIT_SHA256=3d15e80e29f0f4fa1fc686b00613a2220bc37e83a35283d4b4cca1fbd0a5609f;          ;;        s390x)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_s390x';          DUMB_INIT_SHA256=47e4601b152fc6dcb1891e66c30ecc62a2939fd7ffd1515a7c30f281cfec53b7;          ;;       *)          echo "Unsupported arch: ${ARCH}";          exit 1;          ;;     esac;     curl -LfsSo /usr/bin/dumb-init ${DUMB_INIT_URL};     echo "${DUMB_INIT_SHA256} */usr/bin/dumb-init" | sha256sum -c -;     chmod +x /usr/bin/dumb-init;": while running runtime: exit status 1
00:23:02.337  -----------------------------------
00:23:02.337  criu_pingPerf_testCreateRestoreImageAndPushToRegistry_0_FAILED
00:23:02.337  -----------------------------------

Noticed the same issue on JDK11 and JDK17.

https://github.com/OpenLiberty/ci.docker/compare/ee87dfa7f7c7de01d12786aa71517fa8f4007883...3fbc2789ee736701f729febb747082ff9cbbd170 We use releases/latest/beta/Dockerfile.ubi.openjdk17 from https://github.com/OpenLiberty/ci.docker.git instanton branch

llxia commented 1 year ago

Trying to narrow down the issue in the Open Liberty repo. The test can be passed with 30caf5d0e1d71cc9efefc3bec250a6c72c084168 (Grinder link). Grinder failed with 38c8145efc63adc3686466ef4a3322c20720330d (different error) (Grinder link)

00:01:40.541  Successfully tagged localhost/ol-instanton-test-pingperf:latest
00:01:40.541  3ec5e2856fc1c033dc4052d1de5c6384d32a273acabb69671a389b0e07d9bacb
00:01:40.541  create restore image ol-instanton-test-pingperf-restore ...
00:01:54.907  Performing checkpoint --at=afterAppStart
00:01:55.324  
00:03:21.531  CWWKE0954E: The specified (afterappstart) checkpoint phase is empty or unknown.
00:03:21.531  
00:03:22.538  -----------------------------------
00:03:22.538  criu_pingPerf_testCreateRestoreImageAndPushToRegistry_0_FAILED
00:03:22.538  -----------------------------------

FYI @tajila

llxia commented 1 year ago

The PingPerf test is excluded temporarily on zlinux and alinux.

llxia commented 1 year ago

Also noticed exec /bin/sh: exec format error on plinux and zlinux

00:08:13.437  [2/2] STEP 10/23: COPY fixes/ /opt/ol/fixes/
00:08:14.706  --> 7c4b2fa0f46
00:08:14.707  [2/2] STEP 11/23: RUN set -eux;     ARCH="$(uname -m)";     case "${ARCH}" in        aarch64|arm64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_aarch64';          DUMB_INIT_SHA256=b7d648f97154a99c539b63c55979cd29f005f88430fb383007fe3458340b795e;          ;;        amd64|x86_64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64';          DUMB_INIT_SHA256=e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df;          ;;        ppc64el|ppc64le)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_ppc64le';          DUMB_INIT_SHA256=3d15e80e29f0f4fa1fc686b00613a2220bc37e83a35283d4b4cca1fbd0a5609f;          ;;        s390x)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_s390x';          DUMB_INIT_SHA256=47e4601b152fc6dcb1891e66c30ecc62a2939fd7ffd1515a7c30f281cfec53b7;          ;;       *)          echo "Unsupported arch: ${ARCH}";          exit 1;          ;;     esac;     curl -LfsSo /usr/bin/dumb-init ${DUMB_INIT_URL};     echo "${DUMB_INIT_SHA256} */usr/bin/dumb-init" | sha256sum -c -;     chmod +x /usr/bin/dumb-init;
00:08:15.077  exec /bin/sh: exec format error
00:08:17.707  Error: building at STEP "RUN set -eux;     ARCH="$(uname -m)";     case "${ARCH}" in        aarch64|arm64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_aarch64';          DUMB_INIT_SHA256=b7d648f97154a99c539b63c55979cd29f005f88430fb383007fe3458340b795e;          ;;        amd64|x86_64)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64';          DUMB_INIT_SHA256=e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df;          ;;        ppc64el|ppc64le)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_ppc64le';          DUMB_INIT_SHA256=3d15e80e29f0f4fa1fc686b00613a2220bc37e83a35283d4b4cca1fbd0a5609f;          ;;        s390x)          DUMB_INIT_URL='https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_s390x';          DUMB_INIT_SHA256=47e4601b152fc6dcb1891e66c30ecc62a2939fd7ffd1515a7c30f281cfec53b7;          ;;       *)          echo "Unsupported arch: ${ARCH}";          exit 1;          ;;     esac;     curl -LfsSo /usr/bin/dumb-init ${DUMB_INIT_URL};     echo "${DUMB_INIT_SHA256} */usr/bin/dumb-init" | sha256sum -c -;     chmod +x /usr/bin/dumb-init;": while running runtime: exit status 1
00:08:17.708  -----------------------------------
00:08:17.708  criu_pingPerf_testCreateRestoreImageAndPushToRegistry_0_FAILED

Grinder

ymanton commented 1 year ago

The exec /bin/sh: exec format error happens because when we build the Liberty image we are using the x86-64 Semeru image as a base, rather than the aarch64/ppc64le/s390x images:

[2023-06-17T15:33:55.302Z] [2/2] STEP 1/23: FROM icr.io/appcafe/ibm-semeru-runtimes:open-17-ea-jdk-ubi-amd64
[2023-06-17T15:33:55.302Z] WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)

The Liberty Dockerfile at https://github.com/OpenLiberty/ci.docker/blob/main/releases/latest/beta/Dockerfile.ubi.openjdk17 is referencing an architecture-specific base image unfortunately:

FROM icr.io/appcafe/ibm-semeru-runtimes:open-17-ea-jdk-ubi-amd64

...

where as the other Dockerfiles reference architecture-agnostic base images, e.g.:

FROM ibm-semeru-runtimes:open-17-jre-focal

...
ymanton commented 1 year ago

Re: exec /bin/sh: exec format error the test seems to first build a Semeru image for the current platform and tags it as:

[2023-06-17T15:30:54.952Z] Successfully tagged localhost/local-ibm-semeru-runtimes:latest
[2023-06-17T15:30:55.393Z] 5f81856e35c2f07dde8b6df5734367f3583f5ce531d257cf50ce63011e926575

but when building the Liberty image we pull the amd64 image. I'm guessing we want to build the Liberty image on top of the local Semeru image, so we need to modify the Liberty image Dockerfile FROM line.

ymanton commented 1 year ago

The ub22 thread Checkpoint failed is the same problem, but it looks different because somehow the Liberty image is successfully built because the x86-64 binaries are able to run on aarch64:

[2023-06-17T15:28:21.602Z] ++ uname -m
[2023-06-17T15:28:21.602Z] + ARCH=x86_64
[2023-06-17T15:28:21.602Z] + case "${ARCH}" in
[2023-06-17T15:28:21.602Z] + DUMB_INIT_URL=https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
[2023-06-17T15:28:21.602Z] + DUMB_INIT_SHA256=e874b55f3279ca41415d290c512a7ba9d08f98041b28ae7c2acb19a545f1c4df
[2023-06-17T15:28:21.602Z] + curl -LfsSo /usr/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64

I don't know how Docker on aarch64 works and how it can do the above, but once the image is built and we try to start the server it fails.

ymanton commented 1 year ago

Possibly the aarch64 tests are running on a Mac? Google tells me x86-64 Docker images can run on Macs via some emulation magic, so maybe that explains how the ub22 test got much further?

llxia commented 1 year ago

Details about PingPerf tests:

I think the issue is due to a recent Liberty change - switch to multi-stage builds https://github.com/OpenLiberty/ci.docker/commit/7aaf9c52f4aff99cf850d5fd37f83293afc773ed Liberty image Dockerfile has 2 FROM https://github.com/OpenLiberty/ci.docker/blob/instanton/releases/latest/beta/Dockerfile.ubi.openjdk17#L1 https://github.com/OpenLiberty/ci.docker/blob/instanton/releases/latest/beta/Dockerfile.ubi.openjdk17#L27 Since we only replace the first FROM - icr.io/appcafe/ibm-semeru-runtimes:open-17-jdk-ubi, the second FROM will still pull icr.io/appcafe/ibm-semeru-runtimes:open-17-ea-jdk-ubi-amd64, which cause the test to fail.

@ymanton do you know why Liberty uses two Semeru docker images?

tjwatson commented 1 year ago

@ymanton do you know why Liberty uses two Semeru docker images?

At one point we were using the Semeru EA images for the Liberty beta images. But now for https://github.com/OpenLiberty/ci.docker/pull/412 we want the Liberty UBI beta images based on icr.io/appcafe/ibm-semeru-runtimes:open-17-jdk-ubi but it looks like there is an issue with the second FROM. It needs to be updated to use icr.io/appcafe/ibm-semeru-runtimes:open-17-jdk-ubi also.

llxia commented 1 year ago

Once the Liberty dockerfile is finalized, we will update to use releases version (not beta) : https://github.com/OpenLiberty/ci.docker/blob/6b1c9dc9395ada7006da9fd0ebdb485602be98ea/releases/latest/full/Dockerfile.ubi.openjdk11

tjwatson commented 1 year ago

https://github.com/OpenLiberty/ci.docker/pull/412 has been updated to use the correct FROM (not to use the EA semeru image).

tjwatson commented 1 year ago
01:00:06.032  [ERROR   ] CWWKF0001E: A feature definition could not be found for checkpoint-1.0

Do you still see this error? Before we auto configured this feature in our beta images. But we should no longer be doing that. Does your server.xml file you are using configure checkpoint-1.0 Liberty feature? If so you can stop doing that once we have the Liberty GA for InstantOn. We removed the need for this feature for Liberty InstantOn GA.

llxia commented 1 year ago

Thanks @tjwatson. The PingPerf test passed with https://github.com/OpenLiberty/ci.docker/pull/412. Grinder Grinder

llxia commented 1 year ago

Hi @tjwatson , I noticed that https://github.com/OpenLiberty/ci.docker/pull/412 is merged into main branch. When will it be ported into the instanton branch? Or should we switch to using main branch: https://github.com/OpenLiberty/ci.docker/blob/main/releases/latest/beta/Dockerfile.ubi.openjdk17?

tjwatson commented 1 year ago

going forward use the main branch. The instanton branch will be abandoned (maybe removed) at some point.

llxia commented 1 year ago

This issue is resolved. Thanks, everyone!