OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.15k stars 592 forks source link

app pod crash when running checkpoint app image on Power 10 OCP 4.13 #26226

Open tam512 opened 1 year ago

tam512 commented 1 year ago

Deploy checkpoint app image to Power10 OCP 4.13 cluster, when running the application, it failed and the app log has the following errors

Below is the full log:

% oc logs -f ebuy-olf-j17-0
Found mounted TLS certificates, generating keystore

[9/8/23, 22:56:20:221 UTC] 00000042 com.ibm.ws.kernel.launch.internal.FrameworkManager           A Launching defaultServer (Open Liberty 23.0.0.9/wlp-1.0.81.cl230920230905-0401) on Eclipse OpenJ9 VM, version 17.0.8+7 (en_US)
[9/8/23, 22:56:20:227 UTC] 00000042 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0016I: Starting server configuration update.
[9/8/23, 22:56:20:228 UTC] 00000042 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[9/8/23, 22:56:20:228 UTC] 00000042 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[9/8/23, 22:56:20:229 UTC] 00000042 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[9/8/23, 22:56:20:237 UTC] 00000042 com.ibm.ws.config.xml.internal.ConfigValidator               A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration.
  Property trustDefaultCerts has conflicting values:
    Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml.
    Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml.
  Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.

[9/8/23, 22:56:20:259 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/jwt/
[9/8/23, 22:56:20:260 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/badapp/
[9/8/23, 22:56:20:261 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/openapi/ui/
[9/8/23, 22:56:20:261 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/eBuy-ext/
[9/8/23, 22:56:20:261 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/health/
[9/8/23, 22:56:20:261 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/metrics/
[9/8/23, 22:56:20:262 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/svtMessageApp/
[9/8/23, 22:56:20:262 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/microwebapp/
[9/8/23, 22:56:20:262 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/eBuy/
[9/8/23, 22:56:20:262 UTC] 00000046 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0017I: The server configuration was successfully updated in 0.036 seconds.
[9/8/23, 22:56:20:262 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/openapi/
[9/8/23, 22:56:20:262 UTC] 00000042 com.ibm.ws.http.internal.VirtualHostImpl                     A CWWKT0016I: Web application available (default_host): http://ebuy-olf-j17-0.ebuy-olf-j17-headless.ebuy-olf-java17.svc.cluster.local:9080/ibm/api/
[9/8/23, 22:56:20:276 UTC] 0000004b com.ibm.ws.security.token.ltpa.internal.LTPAKeyCreateTask    I CWWKS4105I: LTPA configuration is ready after 0.020 seconds.
[9/8/23, 22:56:20:278 UTC] 00000042 io.openliberty.checkpoint.internal.CheckpointImpl            A CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 0.110 seconds.
[9/8/23, 22:56:20:280 UTC] 00000042 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application svtMessageApp-2.0.2 started in 0.112 seconds.
[9/8/23, 22:56:20:281 UTC] 00000042 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application microwebapp-2.0.1 started in 0.112 seconds.
[9/8/23, 22:56:20:281 UTC] 00000042 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application badapp-2.0.1 started in 0.113 seconds.
[9/8/23, 22:56:20:281 UTC] 00000042 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application ebuy started in 0.113 seconds.
[9/8/23, 22:56:20:282 UTC] 00000042 com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint has been started and is now listening for requests on host *  (IPv6) port 9080.
[9/8/23, 22:56:20:292 UTC] 00000042 com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0012I: The server installed the following features: [appSecurity-4.0, cdi-3.0, concurrent-2.0, distributedMap-1.0, enterpriseBeansLite-4.0, expressionLanguage-4.0, jdbc-4.2, jndi-1.0, json-1.0, jsonb-2.0, jsonp-2.0, jwt-1.0, microProfile-5.0, monitor-1.0, mpConfig-3.0, mpFaultTolerance-4.0, mpHealth-4.0, mpJwt-2.0, mpMetrics-4.0, mpOpenAPI-3.0, mpOpenTracing-3.0, mpRestClient-3.0, pages-3.0, persistence-3.0, persistenceContainer-3.0, restfulWS-3.0, restfulWSClient-3.0, servlet-5.0, ssl-1.0, transportSecurity-1.0, xmlBinding-3.0].
[9/8/23, 22:56:20:293 UTC] 00000042 com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 0.125 seconds.
[9/8/23, 22:56:20:293 UTC] 00000042 com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 0.125 seconds.
[9/8/23, 22:56:20:393 UTC] 0000003e com.ibm.ws.ssl.config.WSKeyStore                             I Successfully loaded default keystore: /opt/ol/wlp/output/defaultServer/resources/security/key.p12 of type: PKCS12
[9/8/23, 22:56:20:400 UTC] 0000003e com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtServiceImpl   I CWWKS5500I: The MicroProfile JWT configuration [MicroProfileJwtService] was successfully processed.
[9/8/23, 22:56:20:400 UTC] 0000003e com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtConfigImpl    I CWWKS5500I: The MicroProfile JWT configuration [defaultMpJwt] was successfully processed.
[9/8/23, 22:56:20:408 UTC] 00000071 com.ibm.ws.tcpchannel.internal.TCPPort                       I CWWKO0219I: TCP Channel defaultHttpEndpoint-ssl has been started and is now listening for requests on host *  (IPv6) port 9443.
Unhandled exception
Type=Segmentation error vmState=0x0002000f
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FFFB26709D0 Handler2=00007FFFB2598B00
R0=00007FFFB0BC76FC R1=00007FFF5141B580 R2=00007FFFB0EF5500 R3=00007FFF5141BEA0
R4=0000000000000000 R5=00007FFF5141B948 R6=0000000000000000 R7=0000000000000000
R8=0000000003C20000 R9=00007FFFAC078CB0 R10=00007FFFB0EE7E48 R11=0000000000000000
R12=00007FFFB0D61340 R13=00007FFF514268E0 R14=00007FFFAC0A9CF8 R15=00007FFF51390000
R16=00007FFFB30A4410 R17=00007FFFB30A4420 R18=00007FFEA01422B0 R19=00007FFFAC0A9CF0
R20=00007FFFB252CA30 R21=00007FFEA0142718 R22=0000000000000000 R23=00007FFFB2D51F50
R24=0000000000000100 R25=0000000000000000 R26=0000000000000001 R27=0000000000000000
R28=00007FFF5141BA58 R29=0000000000000000 R30=0000000000000000 R31=00007FFF5141BEA0
NIP=00007FFFB0D61364 MSR=800000000280D033 ORIG_GPR3=00007FFFB0BBC180 CTR=00007FFFB0D61340
LINK=00007FFFB0BBC184 XER=0000000000000000 CCR=0000000042244222 SOFTE=0000000000000001
TRAP=0000000000000300 DAR=0000000000000000 dsisr=0000000040000000 RESULT=0000000000000000
FPR0 00007fffb2d50340 (f: 3000304384.000000, d: 6.953292e-310)
FPR1 4054dd1120000000 (f: 536870912.000000, d: 8.345417e+01)
FPR2 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
FPR3 3fee666660000000 (f: 1610612736.000000, d: 9.500000e-01)
FPR4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR5 bfe7154748bef6c8 (f: 1220474624.000000, d: -7.213475e-01)
FPR6 3fe62e42fefa39ef (f: 4277811712.000000, d: 6.931472e-01)
FPR7 bfa252f438e10c1e (f: 954272768.000000, d: -3.578914e-02)
FPR8 bfdffffef20a4123 (f: 4060758272.000000, d: -4.999997e-01)
FPR9 bfd00ea348b88334 (f: 1220051712.000000, d: -2.508934e-01)
FPR10 1ff86ff21ff86fef (f: 536375264.000000, d: 1.139135e-154)
FPR11 0000001a002f0235 (f: 3080757.000000, d: 5.517341e-313)
FPR12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR13 0202000000e00202 (f: 14680578.000000, d: 5.375598e-299)
FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/opt/java/openjdk/lib/default/libj9gc29.so
Module_base_address=00007FFFB0B70000
Target=2_90_20230718_506 (Linux 5.14.0-284.25.1.el9_2.ppc64le)
CPU=ppc64le (16 logical CPUs) (0xf9b90000 RAM)
----------- Stack Backtrace -----------
protectedBacktrace+0x24 (0x00007FFFB2592F14 [libj9prt29.so+0x32f14])
omrsig_protect+0x3f4 (0x00007FFFB2599FD4 [libj9prt29.so+0x39fd4])
omrintrospect_backtrace_thread_raw+0xe0 (0x00007FFFB2593530 [libj9prt29.so+0x33530])
protectedIntrospectBacktraceThread+0x24 (0x00007FFFB2592B54 [libj9prt29.so+0x32b54])
omrsig_protect+0x3f4 (0x00007FFFB2599FD4 [libj9prt29.so+0x39fd4])
omrintrospect_backtrace_thread+0x94 (0x00007FFFB2592C64 [libj9prt29.so+0x32c64])
generateDiagnosticFiles+0xd4 (0x00007FFFB26703A4 [libj9vm29.so+0x403a4])
omrsig_protect+0x3f4 (0x00007FFFB2599FD4 [libj9prt29.so+0x39fd4])
structuredSignalHandler+0x194 (0x00007FFFB2670704 [libj9vm29.so+0x40704])
structuredSignalHandlerVM+0x58 (0x00007FFFB2670A28 [libj9vm29.so+0x40a28])
mainSynchSignalHandler+0x294 (0x00007FFFB2598D94 [libj9prt29.so+0x38d94])
 (0x000000000CB70464 [<unknown>+0x0])
_ZN23MM_ScavengerRootScanner11doStackSlotEPP8J9ObjectPvPKv+0xd4 (0x00007FFFB0D61414 [libj9gc29.so+0x1f1414])
_Z17stackSlotIteratorP8J9JavaVMPP8J9ObjectPvP16J9StackWalkStatePKv+0xd4 (0x00007FFFB0BBC184 [libj9gc29.so+0x4c184])
vmThreadStackDoOSlotIterator+0x3c (0x00007FFFB0BC76FC [libj9gc29.so+0x576fc])
jitWalkFrame+0x1d0 (0x00007FFFB1B4D4A0 [libj9jit29.so+0xc4d4a0])
jitWalkStackFrames+0xf40 (0x00007FFFB1B4EC10 [libj9jit29.so+0xc4ec10])
walkStackFrames+0xf4 (0x00007FFFB26C9E44 [libj9vm29.so+0x99e44])
_ZN28GC_VMThreadStackSlotIterator9scanSlotsEP10J9VMThreadS1_PvPFvP8J9JavaVMPP8J9ObjectS2_P16J9StackWalkStatePKvEbb+0x5c (0x00007FFFB0BC788C [libj9gc29.so+0x5788c])
_ZN14MM_RootScanner13scanOneThreadEP18MM_EnvironmentBaseP10J9VMThreadPv+0x144 (0x00007FFFB0BBB004 [libj9gc29.so+0x4b004])
_ZN14MM_RootScanner11scanThreadsEP18MM_EnvironmentBase+0x114 (0x00007FFFB0BBA114 [libj9gc29.so+0x4a114])
_ZN14MM_RootScanner9scanRootsEP18MM_EnvironmentBase+0x80 (0x00007FFFB0BBDEC0 [libj9gc29.so+0x4dec0])
_ZN12MM_Scavenger24workThreadGarbageCollectEP22MM_EnvironmentStandard+0x284 (0x00007FFFB0D59904 [libj9gc29.so+0x1e9904])
_ZN23MM_ParallelScavengeTask3runEP18MM_EnvironmentBase+0x1c (0x00007FFFB0DD250C [libj9gc29.so+0x26250c])
_ZN21MM_ParallelDispatcher16workerEntryPointEP18MM_EnvironmentBase+0x278 (0x00007FFFB0CF30B8 [libj9gc29.so+0x1830b8])
_Z23dispatcher_thread_proc2P14OMRPortLibraryPv+0x160 (0x00007FFFB0CF1760 [libj9gc29.so+0x181760])
omrsig_protect+0x3f4 (0x00007FFFB2599FD4 [libj9prt29.so+0x39fd4])
dispatcher_thread_proc+0x50 (0x00007FFFB0CF1270 [libj9gc29.so+0x181270])
thread_wrapper+0x190 (0x00007FFFB252CBC0 [libj9thr29.so+0xcbc0])
start_thread+0xf8 (0x00007FFFB3069678 [libpthread-2.28.so+0x9678])
clone+0x74 (0x00007FFFB2F58738 [libc-2.28.so+0x138738])
---------------------------------------
JVMDUMP039I Processing dump event "gpf", detail "" at 2023/09/08 22:57:26 - please wait.
JVMDUMP032I JVM requested System dump using '/opt/ol/wlp/output/defaultServer/core.20230908.225726.1030.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" specifies that the core dump is to be piped to an external program.  Attempting to rename either core or core.229.  Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.

JVMPORT049I The core file created by child process with pid = 229 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" to find where the core file is written and ensure that program does not truncate it.

JVMDUMP012E Error in System dump: /opt/ol/wlp/output/defaultServer/core.20230908.225726.1030.0001.dmp
JVMDUMP027W The requested heapdump has not been produced because another component is holding the VM exclusive lock.
JVMDUMP032I JVM requested Java dump using '/opt/ol/wlp/output/defaultServer/javacore.20230908.225726.1030.0003.txt' in response to an event
JVMDUMP012E Error in Java dump: /opt/ol/wlp/output/defaultServer/javacore.20230908.225726.1030.0003.txt
JVMDUMP032I JVM requested Snap dump using '/opt/ol/wlp/output/defaultServer/Snap.20230908.225726.1030.0004.trc' in response to an event

% oc exec ebuy-olf-j17-0 -- ls -l /opt/ol/wlp/output/defaultServer/ total 0 drwxrwx---. 1 default root 19 Sep 8 21:52 logs drwxrwx---. 1 default root 22 Sep 8 21:52 resources drwxrwx---. 1 default root 100 Sep 8 22:57 workarea %

JasonFengJ9 commented 1 year ago

The native stack trace seems similar to

tajila commented 1 year ago

@tam512 is it reproduceable? if you have the diagnostic files (system core) it will help us investigate the issue.

I sugeest doing something like sudo bash -c 'echo "core" > /proc/sys/kernel/core_pattern' on your machine so the core file can be produced.

tam512 commented 1 year ago

refer to eclipse openj9 issue https://github.com/eclipse-openj9/openj9/issues/18115

To get system core files for this defect, follow https://eclipse.dev/openj9/docs/xdump/#piped-system-dumps to pipe core dumps to the worker node using systemd-coredump

Notes that there are 2 known openshift issues when copying the core files to local mcahine: One is 1358 that prevent copying file with \. Two is error: unexpected EOF when copying directory so need to run use oc cp --retries=-1 .... For example: oc cp --retries=-1 worker0-xxx-debug:/host/var/lib/systemd/coredump/ coredump

ymanton commented 1 year ago

This is a bug in the JVM, specific to Power 10, and not related to InstantOn.

vijaysun-omr commented 1 year ago

It seems a fix for this problem was merged into the Semeru development stream for their next release a few weeks back (there was a failure in a scenario unrelated to InstantOn). The next question would be how Liberty SVT plans to obtain a fixed Semeru build to verify that this works in the environment being tested.