adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
130 stars 310 forks source link

AIX: Grinder target cmdLineTester_jvmtitests_5 fails because dependent shared libraries are not found #2264

Open aixtools opened 3 years ago

aixtools commented 3 years ago

Describe the bug This test fails to load - and grinder reports unstable as result.

The java executable (or .jar code) defeat the normal AIX rtld behavior.

JVMJ9TI001E Agent library jvmtitest could not be opened (Could not load module .
System error: No such file or directory)
JVMJ9VM015W Initialization error for library j9jvmti29(-3): JVMJ9VM009E J9VMDllMain failed

Note: JAVA messages - with an AIX rtld message embedded - see below for a pure AIX example of rtld messaging

To Reproduce Steps to reproduce the behavior (or Grinder rerun link): The key command - from a Grinder job - with workspace saved: "/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java" -Xgcpolicy:metronome -Xcompressedrefs -Xdump -agentlib:jvmtitest=test:rnwr001 -cp "/home/jenkins/workspace/Grinder/openjdk-tests/TKG/../../jvmtest/functional/cmdLineTests/jvmtitests/jvmtitest.jar" com.ibm.jvmti.tests.util.TestRunner

Expected behavior That the test runs and finishes with status 0 Screenshots If applicable, add screenshots to help explain your problem.

Additional context To assist with understanding the error message above:

  1. JVMJ9TI001E Agent library jvmtitest could not be opened: This bit is from java - what follows (#2 and 3) is the AIX rtld messages - note: jvmitest is a .jar file
  2. (Could not load module .`: specifically - a module named '.' is not loading - this is probably the error - wrong argument in something.
  3. System error: No such file or directory): the error at OS level
  4. JVMJ9VM015W Initialization error for library j9jvmti29(-3): (java message)
  5. JVMJ9VM009E J9VMDllMain failed` - multiple messages: JVM messages - related to the shared library j9jvmti29 (libj9jvmti29.so). I am guessing the java routine J9VMDIMain (in C main() ?) does not load - because it could not load the Agent (.jar)

To further help reading: an example using a simple application (/usr/bin/python)

PYTHON has the following library dependency:

/usr/bin/python needs:
         /opt/freeware/lib/libpython2.7.so
         /opt/freeware/lib/libexpat.a(libexpat.so.1)
         /opt/freeware/lib/libffi.a(libffi.so.6)
         /opt/freeware/lib/libncurses.so
         /opt/freeware/lib/libsqlite3.so
         /opt/freeware/lib/libstdc++.a(libstdc++.so.6)
         /opt/freeware/lib/libgcc_s.a(shr.o)
         /usr/lib/libc.a(shr.o)
         /usr/lib/librtl.a(shr.o)
         /usr/lib/libdl.a(shr.o)
         /usr/lib/libpthreads.a(shr_comm.o)
         /usr/lib/libpthreads.a(shr_xpg5.o)
         /usr/lib/libpthread.a(shr_xpg5.o)
         /unix
         /usr/lib/libcrypt.a(shr.o)

The first library it needs is /opt/freeware/lib/libpython2.7.so. It's dependencies are:

/opt/freeware/lib/libpython2.7.so needs:
         /usr/lib/libdl.a(shr.o)
         /opt/freeware/lib/libexpat.a(libexpat.so.1)
         /opt/freeware/lib/libffi.a(libffi.so.6)
         /opt/freeware/lib/libncurses.so
         /opt/freeware/lib/libsqlite3.so
         /usr/lib/libpthreads.a(shr_comm.o)
         /usr/lib/libpthreads.a(shr_xpg5.o)
         /usr/lib/libc.a(shr.o)
         /usr/lib/librtl.a(shr.o)
         /opt/freeware/lib/libgcc_s.a(shr.o)
         /usr/lib/libpthread.a(shr_xpg5.o)
         /unix
         /usr/lib/libcrypt.a(shr.o)

To get a similar message as this test - rather than rename libpython2.7.so I rename one of it's dependencies - and then try to start python.

# mv          /opt/freeware/lib/libgcc_s.a          /opt/freeware/lib/libgcc_s.a.save
# python
Could not load program python:
        Dependent module libgcc_s.a(shr.o) could not be loaded.
Could not load module libgcc_s.a(shr.o).
System error: No such file or directory

Note: this message is known to IBM - but for a differnt case:

JVMJ9TI001E Agent library am_ibm_16 could not be opened (Could not load module .
System error: No such file or directory)
JVMJ9VM015W Initialization error for library j9jvmti29(-3): JVMJ9VM009E J9VMDllMain failed

My first thoughts is that it can be resolved by xml files - currently not available in the test directories:

3) The last place to check is the server.xml file itself.   Look at the genericJvmarguments line in the file and check the -Xbootclasspath/p:${ITCAMDCHOME}/toolkit/lib/bcm-bootstrap.jar line.
4) Lastly, check the variables.xml file.  In this file, specifically check the  AM_CONFIG_JVM_ARGS variable.  In the value section, check the -Xbootclasspath and the agentlib parms specified here.

Or, perhaps an additional argument - something like -Xbootclasspath/p:/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/jre/lib/ppc64/compressedrefs/ (I do not know java - so this may include a syntax error)

smlambert commented 3 years ago

Thanks for the report @aixtools, I will look at how the cmdLineTester_jvmtitests_5 test is configured to see what our options are for changing test config.

smlambert commented 3 years ago

Rerunning cmdLineTester_jvmtitests_5 in a Grinder to get a fresh set of results: jdk8 in Grinder/7235 - fails as described above jdk11 in Grinder/7236 - passes

smlambert commented 3 years ago

Interesting, based on Grinders above, this test passed in the jdk11 run on test-osuosl-aix72-ppc64-1, but failed in the jdk8 run on same machine.

aixtools commented 3 years ago

Yes - very interesting. I'll try as well - with save workspace - and run my traces/trcrpt to see what OS behavioral changes I can find.

aixtools commented 3 years ago

Placeholders: https://ci.adoptopenjdk.net/job/Grinder/7290/ (jdk8) and https://ci.adoptopenjdk.net/job/Grinder/7291/ (jdk11)

Marked with 'Keep Build' so the logs don't disappear so quickly.

Note: remember to release when finished. Forever, is a very long time :)

aixtools commented 3 years ago

I looked at your console outputs - at the surface level the commands seem the same - so what does changing jdk version change in terms of what actually get executed? Helps to know what/where to look.

smlambert commented 3 years ago

Okay it actually looks like there is no native testimage package for jdk8 being found and downloaded via API (and therefore the message, Agent library jvmtitest could not be opened (Could not load module . System error: No such file or directory)

jdk11 case:

18:16:30  + ./openjdk-tests/get.sh -s /home/jenkins/workspace/Grinder -t /home/jenkins/workspace/Grinder/openjdk-tests -p ppc64_aix -r nightly -j 11 -i openj9 --openj9_repo https://github.com/eclipse/openj9.git --openj9_branch master --tkg_repo https://github.com/AdoptOpenJDK/TKG.git --tkg_branch master
18:16:30  TESTDIR: /home/jenkins/workspace/Grinder/openjdk-tests
18:16:30  get jdk binary...
18:16:30  _ENCODE_FILE_NEW=UNTAGGED curl -OLJSks  https://api.adoptopenjdk.net/v3/binary/latest/11/ea/aix/ppc64/jdk/openj9/normal/adoptopenjdk
18:16:40  _ENCODE_FILE_NEW=UNTAGGED curl -OLJSks  https://api.adoptopenjdk.net/v3/binary/latest/11/ea/aix/ppc64/testimage/openj9/normal/adoptopenjdk
18:16:41  Uncompressing file: OpenJDK11U-jdk_ppc64_aix_openj9_2021-02-17-04-54.tar.gz ...
18:16:53  Uncompressing file: OpenJDK11U-testimage_ppc64_aix_openj9_2021-02-16-18-06.tar.gz ...
18:16:55  Run /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -version
...

jdk8 case:

17:55:59  + ./openjdk-tests/get.sh -s /home/jenkins/workspace/Grinder -t /home/jenkins/workspace/Grinder/openjdk-tests -p ppc64_aix -r nightly -j 8 -i openj9 --openj9_repo https://github.com/eclipse/openj9.git --openj9_branch master --tkg_repo https://github.com/AdoptOpenJDK/TKG.git --tkg_branch master
17:55:59  TESTDIR: /home/jenkins/workspace/Grinder/openjdk-tests
17:55:59  get jdk binary...
17:55:59  _ENCODE_FILE_NEW=UNTAGGED curl -OLJSks  https://api.adoptopenjdk.net/v3/binary/latest/8/ea/aix/ppc64/jdk/openj9/normal/adoptopenjdk
17:56:04  Uncompressing file: OpenJDK8U-jdk_ppc64_aix_openj9_2021-02-16-06-25.tar.gz ...
17:56:12  Run /home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/bin/java -version
...
smlambert commented 3 years ago

We need to look and see why the native test image for jdk8 AIX is not found, is it being built?

From the build job (example: Build 980), looks like they are built.

Screen Shot 2021-02-26 at 11 38 43 AM

Next thing to check, are they being moved correctly to the releases repo?

smlambert commented 3 years ago

testimages for jdk8 j9 aix are on the website and so are being moved to releases repo presumably

Screen Shot 2021-02-26 at 12 48 01 PM

Our test script will attempt to download the jdk binary, and the testimage and debug packages by default, next check, is this broken for this particular case? Or does the API not serve up the testimage package for some reason?

aixtools commented 3 years ago

Well, this seems to explain why the test fails. A giant step forward.

sophia-guo commented 3 years ago

Looks like originally testimage is not available for jdk8 https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/get.sh#L225 https://github.com/AdoptOpenJDK/openjdk-tests/pull/1661. Need to update to support testimage for jdk8.

smlambert commented 3 years ago

https://github.com/AdoptOpenJDK/openjdk-tests/pull/2320 is now merged (and fixes the originally reported issue). Verifying via this Grinder: https://ci.adoptopenjdk.net/job/Grinder/7422

It may be that there is a new different issue for this test target, but that should get raised separately.

smlambert commented 3 years ago

We now have the the native test images, but we see that the test causes a crash based on output from Grinder/7422, not investigated why or if it has been seen at ci.eclipse.org/openj9 (@pshipton have you seen this before?):

----------- Stack Backtrace -----------
(0x09000000105B247C [libj9vm29.so+0x7d47c])
runCallInMethod+0x224 (0x09000000105A7608 [libj9vm29.so+0x72608])
gpProtectedRunCallInMethod__FPv+0x48 (0x0900000010596E4C [libj9vm29.so+0x61e4c])
signalProtectAndRunGlue+0x28 (0x090000001058E5AC [libj9vm29.so+0x595ac])
omrsig_protect+0x488 (0x090000000743DE4C [libj9prt29.so+0x59e4c])
gpProtectAndRun+0xf8 (0x090000001058E6FC [libj9vm29.so+0x596fc])
gpCheckCallin+0x114 (0x0900000010596DB8 [libj9vm29.so+0x61db8])
callStaticVoidMethod+0x48 (0x09000000105F708C [libj9vm29.so+0xc208c])
(0x0000010000003FD0 [java+0x3fd0])
_pthread_body+0xe8 (0x090000000059DFEC [libpthread.a+0x3fec])
---------------------------------------
pshipton commented 3 years ago

I don't see any matches.

JasonFengJ9 commented 3 years ago

There is an internal workitem (RTC 145045: [aix_ppc-64_cmprssptrs] SE80_GIT J9vmTest_2 crash -Xlp16G Requested page size in GC logs matches with page size in -verbose:sizes output) , which is a regression being investigated but probably introduced a while ago just occurred more recently.

----------- Stack Backtrace -----------
(0x0900000079E2B07C [libj9vm29.so+0x6b07c])
runCallInMethod+0x238 (0x0900000079E21B3C [libj9vm29.so+0x61b3c])
gpProtectedRunCallInMethod__FPv+0x48 (0x0900000079E11F6C [libj9vm29.so+0x51f6c])
signalProtectAndRunGlue+0x28 (0x0900000079E0B9AC [libj9vm29.so+0x4b9ac])
(0x090000007A0A38A8 [libj9prt29.so+0x528a8])
gpProtectAndRun+0xf8 (0x0900000079E0BAFC [libj9vm29.so+0x4bafc])
gpCheckCallin+0x114 (0x0900000079E11ED8 [libj9vm29.so+0x51ed8])
callStaticObjectMethod+0x54 (0x0900000079E62FB8 [libj9vm29.so+0xa2fb8])
LoadMainClass+0x10c (0x0000010000005CB0 [java+0x5cb0])
JavaMain+0x454 (0x0000010000004718 [java+0x4718])
_pthread_body+0xf0 (0x0900000000515D34 [libpthreads.a+0x3d34])

Note: the internal test failure is at J9vmTest_2 while the failure in this issue is cmdLineTester_jvmtitests_5.

JasonFengJ9 commented 3 years ago

Actually I think they might be related

===============================================
Running test cmdLineTester_jvmtitests_5 ...
===============================================
cmdLineTester_jvmtitests_5 Start Time: Wed Mar  3 19:00:44 2021 Epoch Time (ms): 1614819644119
variation: Mode351
JVM_OPTIONS:  -Xgcpolicy:metronome -Xcompressedrefs 

This failure is -Xgcpolicy:metronome while the internal RTC 145045 investigation indicated that it only occurs in this GC policy (along with another corner condition) as well.

pshipton commented 3 years ago

The problem Jason refers to (145045) was introduced Feb 17th by https://github.com/eclipse/openj9/pull/11305 and/or https://github.com/eclipse/omr/pull/5652. It would be interesting to see the result of java -verbose:sizes on the machine where it occurs, as we've only seen the problem on a machine where 16G pages are listed as an available size, like follows:

  -Xlp:objectheap:pagesize=64K   large page size
                  available large page sizes:
                  4K
                  64K
                  16M
                  16G
pshipton commented 3 years ago

145045 also only occurs when -Xlp16G is used, but perhaps this is a different manifestation of the same problem.

@zl-wang fyi