Feedback on test reproduction quirks for the test team

fjeremic commented 5 years ago

While attempting to launch Grinders and reproduce #4526 locally I kept notes of some of the quirks I encountered or issues I stumbled upon. Some of these have been detailed in the various documentation, others are unspecified. I hope this feedback can be used to improve our documentation and/or processes for debugging:

Pain Points:

It is not clear from a failed test what JVM command line options were used
- Example: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/
- As a result it is not clear how to add EXTRA_OPTIONS or JVM_OPTIONS to a Grinder
For a particular test failure it is non-obvious to which test bucket the test belongs to
- Is it functional? Is it systemtest? Is it some Adopt test?
- Other than going to the ~4 different repos and searching for the test name is there a better way to know?
- Need to know this so one can fill in the BUILD_LIST in the Grinder
From a test failure looking at the java -version output it is not clear where to download the JDK
- For example I know the SHA numbers and the build date and number too "20190130_208 (JIT enabled, AOT enabled)" but where do I download this build?
- Can see the curl command from the Grinder so I can find the JDK from there but why shouldn't we be able to extract the build ID from java -version somehow? If someone just pasted me the java -version output I would have no idea how to grab that same build, and that is a problem.
STF tests don't show crash information in the console output
- Have to dig through the artifacts to find it always which is time consuming
Why do we need to export JDK_VERSION when running tests? Should we be able to determine that from $JAVA_BIN/java -version output?
When attempting to reproduce [1] locally the make compile command from instructions from [2] seems to compile all tests (sometimes comping ~6000 java source files) however a Grinder launched for the same test seems only compile and run the one specific test [3]. Why is that? How can I locally do thet same thing as the Grinder? i.e. I only want to compile and run my one test I care about.

[1] https://github.com/eclipse/openj9/issues/4526 [2] https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally#run-sanity-system-tests-on-jdk10_x86-64_linux_openj9-sdk [3] https://hyc-runtimes-jenkins.swg-devops.com/view/Test_grinder/job/Grinder/1363/consoleText

Issues:

https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally#general-steps
- Cannot run ./get.sh without setting JAVA_BIN first which seems to be step 6 so the instructions need to get updated
  - System tests also seem to export JAVA_HOME to be ../../ from JAVA_BIN, so it means JAVA_BIN has to be the "jre/bin" directory, bot the "bin" directory. This is non-obvious.
- Last step make _sanity.system does not work due to class loader net.adoptopenjdk.stf.runner.StfClassLoader exception being thrown. Exporting both JAVA_BIN and JAVA_HOME does not seem to work on s390 Linux as implied by other people.

GEN stderr Exception in thread "main" java/lang/Error: java.lang.ClassNotFoundException: net.adoptopenjdk.stf.runner.StfClassLoader
GEN stderr      at java/lang/ClassLoader.getSystemClassLoader (ClassLoader.java:781)
GEN stderr      at java/lang/Thread.completeInitialization (Thread.java:166)
GEN stderr      at java/lang/J9VMInternals.completeInitialization (J9VMInternals.java:72)
Generation failed

The issue seems to be that System tests define JAVA_HOME themselves by exporting $JAVA_BIN/../../, however the instructions in [1] specify JAVA_BIN should be /someLocation/bin, which appears to be incorrect as the instructions state to download/unpack the SDK to /someLocation.

System tests seem to expect JAVA_BIN to be /someLocation/jre/bin, not /someLocation/bin. After changing this, rerunning make -f run_configure.mk and make compile things seem to work now.

Instructions in AdoptOpenJDK seem to imply one can run make SharedClassesAPI however this results in errors
- https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/doc/userGuide.md#sub-targets-by-test-name
- It seems one must prefix things with an underscore "_" and add a "_0" suffix, so make _SharedClassesAPI_0?

General Feedback:

Tests seem to run in huge buckets per one Jenkins job as opposed to much smaller buckets per Jenkins job. This makes re-running a test tedious and involves a lot of manual work as opposed to VMFarm-esque clicking a "Re-run on Grinder" button and launching a reproduction batch.
Grinder tests are sequential which is very time consuming when it comes to reproducing issues which are intermittent (1/50 failures take several hours as opposed to a few minutes to reproduce)
- Sometimes a failure occurs in the middle of a grinder, say the 5th job out of 50. Is there a way to "kill" the grinder and just get the data for the failure at that point without running through the other 45 iterations?
Test material seems scattered all over the place and it is not easy to find tests
- Each test repo seems to have it's own set of instructions and quirks for running locally or running in Grinders
- For example the following two (and I'm sure there are more too)
  - https://github.com/eclipse/openj9/tree/master/test#quick-start
  - https://github.com/AdoptOpenJDK/openjdk-tests/blob/master/doc/userGuide.md#local-testing-via-make-targets-on-the-commandline
Getting machine access is non-trivial (impossible?) which makes reproducing issues which only appear to happen on farm machines very difficult

fjeremic commented 5 years ago

@smlambert @llxia FYI. I'm more than happy to help with any of the above.

pshipton commented 5 years ago

For a particular test failure it is non-obvious to which test bucket the test belongs to

It is in the link i.e. https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/ extended.system indicates system testing.

but where do I download this build?

You need to look at the parent job(s) of the test failure, find the build job for the platform, and the jvm is an artifact of the job. It is only available for a short time, depending on how many other builds are run, as we have limited space.

llxia commented 5 years ago

I will try to clarify some of the questions.

It is not clear from a failed test what JVM command line options were used Example: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/

TKG does print info at beginning of the tests with JVM_OPTIONS:

===============================================
Running test SharedClassesAPI_0 ...
===============================================
SharedClassesAPI_0 Start Time: Thu Jan 31 03:14:37 2019 Epoch Time (ms): 1548922477290
variation: NoOptions
JVM_OPTIONS:  -Xcompressedrefs

As a result it is not clear how to add EXTRA_OPTIONS or JVM_OPTIONS to a Grinder

It is documented in: https://github.com/AdoptOpenJDK/openjdk-tests/wiki/How-to-Run-a-Grinder-Build-on-Jenkins https://github.com/eclipse/openj9/blob/master/test/docs/OpenJ9TestUserGuide.md

For a particular test failure it is non-obvious to which test bucket the test belongs to Is it functional? Is it systemtest? Is it some Adopt test? Other than going to the ~4 different repos and searching for the test name is there a better way to know? Need to know this so one can fill in the BUILD_LIST in the Grinder

The information is in the job name. For example, Test-extended.functional-JDK11-linux_x86-64_cmprssptrs means running extended functional test using JDK11 on linux_x86-64_cmprssptrs. For system test, you should see system in the job name.

From a test failure looking at the java -version output it is not clear where to download the JDK For example I know the SHA numbers and the build date and number too "20190130_208 (JIT enabled, AOT enabled)" but where do I download this build?

In Openj9 Jenkins, we can get parameters from test build https://ci.eclipse.org/openj9/view/Test/job/Test-extended.functional-JDK11-linux_x86-64_cmprssptrs/169/parameters/

It shows UPSTREAM_JOB_NAME and UPSTREAM_JOB_NUMBER. We should be able to find the build in Jenkins Build tab. Once we find the exact JDK build, we can Copy Link Address of the archived JDK.

Or another way to do this https://github.com/eclipse/openj9/issues/3697#issuecomment-439166452

Can see the curl command from the Grinder so I can find the JDK from there but why shouldn't we be able to extract the build ID from java -version somehow? If someone just pasted me the java -version output I would have no idea how to grab that same build, and that is a problem.

The grinder can take JDK from any public url (i.e., AdoptOpenJDK, Artifactory, etc). We may not have enough infromation to determine OpenJ9 build ID.

STF tests don't show crash information in the console output Have to dig through the artifacts to find it always which is time consuming

@Mesbah-Alam we may need to update STF to handle this.

Why do we need to export JDK_VERSION when running tests? Should we be able to determine that from $JAVA_BIN/java -version output?

We are working on this https://github.com/eclipse/openj9/issues/442. The idea is we do not need to provide JDK_VERSION, JDK_IMPL and SPEC. All the information can be auto-detected when JAVA_BIN is provided.

When attempting to reproduce [1] locally the make compile command from instructions from [2] seems to compile all tests (sometimes comping ~6000 java source files) however a Grinder launched for the same test seems only compile and run the one specific test [3]. Why is that? How can I locally do thet same thing as the Grinder? i.e. I only want to compile and run my one test I care about.

We can use BUILD_LIST to narrow down to the folder that we care about. This is documented in FAQ Maybe we should add a link to FAQ in https://github.com/eclipse/openj9/wiki/Reproducing-Test-Failures-Locally Note: this feature only works for subdirs in functional atm. Support for systemtest is on the way.

Tests seem to run in huge buckets per one Jenkins job as opposed to much smaller buckets per Jenkins job. This makes re-running a test tedious and involves a lot of manual work as opposed to VMFarm-esque clicking a "Re-run on Grinder" button and launching a reproduction batch.

Test job does not have all parameters defined in config, so rebuild maynot work. One thing that is on our to-do list is to auto generate test jobs, so that we can avoid this issue.

Grinder tests are sequential which is very time consuming when it comes to reproducing issues which are intermittent (1/50 failures take several hours as opposed to a few minutes to reproduce) Sometimes a failure occurs in the middle of a grinder, say the 5th job out of 50. Is there a way to "kill" the grinder and just get the data for the failure at that point without running through the other 45 iterations?

Issue is created https://github.com/AdoptOpenJDK/openjdk-tests/issues/836 Once parallel is enabled, iteration 50 means start 50 separate jobs. We can kill any of them in the middle of the grinder.

Getting machine access is non-trivial (impossible?) which makes reproducing issues which only appear to happen on farm machines very difficult

Unfortunately, test team does not have control of the machines access. fyi @jdekonin

fjeremic commented 5 years ago

It is in the link i.e. ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181 extended.system indicates system testing.

Right, this is also obvious from the test name. So where do I find this test? Is "extended.system" == "systemtest"? That part is confusing, at least to me.

You need to look at the parent job(s) of the test failure, find the build job for the platform, and the jvm is an artifact of the job. It is only available for a short time, depending on how many other builds are run, as we have limited space.

Using your example: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/

I navigate to "build number 850", then to "build number 383", then I seem to be at the top level for this nightly build: https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All/383/

I fail to see how to navigate to the build artifact you describe. Can you describe the steps from here?

fjeremic commented 5 years ago

It is documented in: AdoptOpenJDK/openjdk-tests/wiki/How-to-Run-a-Grinder-Build-on-Jenkins /test/docs/OpenJ9TestUserGuide.md@master

There are quirks. For example it is non-obvious how to input the following command:

-Xjit:{java/lang/SomeClass.foo()I}(tracefull,log=foo.trace)

Through experimentation and help from others it seems you have to double quote the full command and escape the quotes, so the actual thing you have to input is:

\"-Xjit:{java/lang/SomeClass.foo()I}(tracefull,log=foo.trace)\"

In Openj9 Jenkins, we can get parameters from test build ci.eclipse.org/openj9/view/Test/job/Test-extended.functional-JDK11-linux_x86-64_cmprssptrs/169/parameters

It shows UPSTREAM_JOB_NAME and UPSTREAM_JOB_NUMBER. We should be able to find the build in Jenkins Build tab. Once we find the exact JDK build, we can Copy Link Address of the archived JDK.

Or another way to do this #3697 (comment)

Neither of these seem to work for the test failure example at hand from #4526:

I open up the test link from #4526:
- https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/
Navigate to parameters on the left
- https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-linux_390-64_cmprssptrs/181/parameters/
See that it came from build 1178
Navigate to build 1178 using your link
- https://ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1178/
- I don't see any artifacts or build links here

Navigating to the build artifact from a test failure would be good to know, however my original question was if there is a way to navigate to the build artifact only using the java -version output which I can always find inside of a test failure console log.

We can use BUILD_LIST to narrow down to the folder that we care about. This is documented in FAQ Maybe we should add a link to FAQ in eclipse/openj9/wiki/Reproducing-Test-Failures-Locally Note: this feature only works for subdirs in functional atm. Support for systemtest is on the way.

Ah I see, I think I encountered the systemtest limitation here then.

Thanks for all the answers!

llxia commented 5 years ago

I fail to see how to navigate to the build artifact you describe. Can you describe the steps from here?

You do not need to get to build number 383. The information is in console output of build number 850

Hopefully, this comment lists the steps clearly https://github.com/eclipse/openj9/issues/3697#issuecomment-439166452

llxia commented 5 years ago

JDK build 1178 passed but not having JDK archived. I do see tar command in the console. https://ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1178/console

And the next nightly build have JDK archived https://ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1184/

@AdamBrousseau Is there a limitaton on how long the artifacts are kept?

fjeremic commented 5 years ago

You do not need to get to build number 383. The information is in console output of build number 850

Hopefully, this comment lists the steps clearly #3697 (comment)

Right but there is no archive link anywhere.

JDK build 1178 passed but not having JDK archived. I do see tar comment in the console. ci.eclipse.org/openj9/view/Build/job/Build-JDK8-linux_390-64_cmprssptrs/1178/console

Yeah it says:

23:51:16 ARTIFACTORY server is not set saving artifacts on jenkins.

Not sure why it worked on the very next build. Does that mean I can't get a hold of the exact binary JDK package used from that build? (without having to rebuild the entire JDK using the SHAs)

AdamBrousseau commented 5 years ago

We only have space to keep 10 artifacts per build at the moment

pshipton commented 5 years ago

We should be able to pass in the build number and have it appear in the -version output, fairly certain there is a configure parameter to allow this. In this part below I think we could change the +0 to be the build number. (build 11.0.2-internal+0-adhoc.jenkins.Build-JDK11-linuxx86-64cmprssptrs)

smlambert commented 4 years ago

A lot of these items have now been addressed through several major updates and features added, including and not limited to:

variation (from playlist) and JVM_OPTIONS used are printed at the start of each test run Example console output: 15:27:52 variation: NoOptions 15:27:52 JVM_OPTIONS: -Xcompressedrefs
Re-run link for easier prepopulation of Grinder parameters
AUTODETECT, so if you use customized/upstream SDK_RESOURCE, you no longer need to tell TKG what JDK_VERSION/JDK_IMPL it is
removed the "make -f run_configure.mk" step, to simplify test runs even further
better doc to ensure developers know utilize BUILD_LIST to control which directories to be compiled
new logical target called _testList (to allow a custom list of test targets to be passed to TKG, therefore to a Grinder/test job/workflow)
rename all directories in openjdk-tests repo to match the test group names (system == system, external == external, etc)
simplification of using get.sh (can now simply clone openjdk-tests, export TEST_JDK_HOME=/whereEverYouPutYourJDK and then run get.sh with no arguments)
centralization of test doc can be tracked via https://github.com/AdoptOpenJDK/openjdk-tests/issues/1558
smart parallelization work can be tracked via https://github.com/AdoptOpenJDK/openjdk-tests/issues/1563

Most other items have been addressed in comments above. The suggested enhancement to STF output should be raised against the STF repo, though I do not believe it will get any priority (no resources to spare), and that STF output is already too verbose (would want to reduce noise, before adding new 'content' to the output stream).

Given all of that, I believe we can/should close this issue, @fjeremic ?

fjeremic commented 4 years ago

Agreed. Many thanks to the test team who invested resource into fixing most of these issues. I certainly have observed the improvements and am very grateful for the investment in this area. Thank you!

eclipse-openj9 / openj9

Feedback on test reproduction quirks for the test team #4548