adoptium / openj9-systemtest

Long running J9 tests
Other
5 stars 37 forks source link

Test-extended.system-JDK8-win_x86 SharedClassesAPI_0 sharedcc_LOCAL SERVICE #62

Closed pshipton closed 5 years ago

pshipton commented 5 years ago

https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-win_x86/108 https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-win_x86-64_cmprssptrs/111

The test seems to have found a shared cache which is unrelated to the test. Perhaps the test should set a cache directory so it does not find unrelated cache files.

SCC stderr INFO: cache name is sharedcc_LOCAL SERVICE
...
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Attempting to delete cache: DefaultLocationGroupAccessJavaWL3 and return value from delete call was: 0
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Return value means destroyed all caches
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Attempting to delete cache: DefaultLocationGroupAccessJavaWL4 and return value from delete call was: 0
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Return value means destroyed all caches
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Attempting to delete cache: DefaultLocationGroupAccessJavaWL1 and return value from delete call was: 0
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Return value means destroyed all caches
SCC ************
SCC DELETION FAILED
SCC ************
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Attempting to delete cache: DefaultLocationGroupAccessJavaWL2 and return value from delete call was: 0
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Return value means destroyed all caches
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Attempting to delete cache: sharedcc_LOCAL SERVICE and return value from delete call was: -2
SCC stderr Dec 21, 2018 12:23:08 AM net.openj9.sc.api.SharedClassesCacheChecker delete
SCC stderr INFO: Return value means DESTROY_FAILED_CURRENT_GEN_CACHE
STF 00:23:08.169 - **FAILED** Process SCC ended with exit code (1) and not the expected exit code/s (0)
pshipton commented 5 years ago

@Mesbah-Alam @smlambert

pshipton commented 5 years ago

I also note https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/8 found a cache sharedcc_jenkins which is unrelated to the test, and destroyed it.

STF 01:16:10.256 - +------ Step 3 - Destroy Persistent Shared Classes Caches
STF 01:16:10.256 - | Destroy all persistent caches
STF 01:16:10.256 - |
STF 01:16:10.257 - Running command: /Users/jenkins/workspace/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/openjdkbinary/j2sdk-image/jre/bin/../../bin/java -Xshareclasses:destroyAll
STF 01:16:10.257 - Redirecting stderr to /Users/jenkins/workspace/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/openjdk-tests/TestConfig/scripts/testKitGen/../../../TestConfig/test_output_15453722074971/SharedClassesAPI_0/20181221-011606-SharedClassesAPI/results/3.SCC.stderr
STF 01:16:10.257 - Redirecting stdout to /Users/jenkins/workspace/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/openjdk-tests/TestConfig/scripts/testKitGen/../../../TestConfig/test_output_15453722074971/SharedClassesAPI_0/20181221-011606-SharedClassesAPI/results/3.SCC.stdout
STF 01:16:10.265 - Monitoring processes: SCC
SCC stderr 
SCC stderr Attempting to destroy all caches in cacheDir /Users/jenkins/javasharedresources/
SCC stderr 
SCC stderr JVMSHRC806I Compressed references persistent shared cache "sharedcc_jenkins" has been destroyed. Use option -Xnocompressedrefs if you want to destroy a non-compressed references cache.
JasonFengJ9 commented 5 years ago

A few similar occurrences at https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-osx_x86-64_cmprssptrs/36/tapResults/.

STF 04:24:08.325 - Monitoring processes: WL1 WL2 WL3 WL4
STF 04:24:14.490 - **FAILED** Process WL3 ended with exit code (1) and not the expected exit code/s (0)

SharedClassesWorkloadTest_Softmx_IncreaseDecrease_0/20190102-042841-SharedClassesWorkloadTest_Softmx_IncreaseDecrease/results/4.jvm1.stderr

Failed to find class java/lang/Object in shared cache for class-loader id 0. 
Stored class java/lang/Object in shared cache for class-loader id 0 with URL /Users/jenkins/workspace/Test-extended.system-JDK11-osx_x86-64_cmprssptrs/openjdkbinary/j2sdk-image/lib/modules (index 0).
Failed to find class java/lang/J9VMInternals in shared cache for class-loader id 0.
Stored class java/lang/J9VMInternals in shared cache for class-loader id 0 with URL /Users/jenkins/workspace/Test-extended.system-JDK11-osx_x86-64_cmprssptrs/openjdkbinary/j2sdk-image/lib/modules (index 0). 
Failed to find class com/ibm/oti/vm/VM in shared cache for class-loader id 0.
Mesbah-Alam commented 5 years ago

The test does define a folder exclusive to the test in which it creates some caches. However, some of the use cases that the test implements seem to deal with not using that designated folder but instead use the default cache location: https://github.com/eclipse/openj9-systemtest/blob/956e2ac3f18e6c37c93d32b8fab79bc54d2594c3/openj9.test.sharedClasses.jvmti/src/test.sharedClasses.jvmti/net/openj9/stf/SharedClassesAPI.java#L60.

So, the fact that the test is finding caches unrelated to the test is something that, I suspect, is working by design.

Mesbah-Alam commented 5 years ago

found a cache sharedcc_jenkins which is unrelated to the test, and destroyed it.

The test does destroy all persistent and non-persistent caches from the default location, i.e., not the folder specific to the test:
https://github.com/eclipse/openj9-systemtest/blob/956e2ac3f18e6c37c93d32b8fab79bc54d2594c3/openj9.test.sharedClasses.jvmti/src/test.sharedClasses.jvmti/net/openj9/stf/SharedClassesAPI.java#L118


Hi Simon, since I was not involved in the original development of this test, do you recall anything as to why the step was added to destroy all caches in the setup stage of this test?

This is resulting in deletion of some caches that are completely unrelated to the test (e.g. sharedcc_jenkins) and that may be important to the Jenkins slave machine on which the test is running. We need to find a more efficient "clean up" method for this test. @lumpfish

pshipton commented 5 years ago

@Mesbah-Alam is there an outlook for fixing this? The SharedClassesAPI_0 test continues to fail on Windows, and likely osx, every night.

lumpfish commented 5 years ago

I recall that the shared classes tests had issues if caches had been left around from previous tests, but I don't know specifics. I think one issue was that if they were left lying around in unique test specific directories they simply accumulated over time with no means of clearing them, so it was not noticed until the test machine started to run out of resources.

lumpfish commented 5 years ago

Are the caches which the test is unable to destroy there for a reason? Has the default java behaviour changed so that shared classes is now there for any 'general' java process. If so then arbitrarily cleaning them up won't be tenable any more. If we are only concerned with the test aborting because the delete fails then one option would be to make the delete failure non-fatal.

Mesbah-Alam commented 5 years ago

@lumpfish - tests can definitely clean up the shared classes caches in the test specific directories. The problem arises when they try to delete shared classes caches from elsewhere - which include caches that tests fail to delete, e.g. CC stderr INFO: Attempting to delete cache: sharedcc_LOCAL SERVICE and return value from delete call was: -2

Can we restrict the tests to only clean up test-specific caches from the test-specific directory and may be delete only the caches that it creates outside of it (e.g. provide cache name in delete command)?

pshipton commented 5 years ago

I guess since we just disabled shared classes by default, we won't see this problem any more until it gets re-enabled again for the next release.

pshipton commented 5 years ago

Seems the caches continue to persist on the machines, although shared classes by default is disabled now. The test is still failing. https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-win_x86-64_cmprssptrs/129

pshipton commented 5 years ago

If we can't fix the tests soon, we may need to clean up the machines @AdamBrousseau @jdekonin

smlambert commented 5 years ago

I don't think it should be left as an either / or scenario, should be both happening: 1) work to fix tests 2) regular/automated machine cleanup.

AdamBrousseau commented 5 years ago

What are the files/folders I need to cleanup? I will add it to the cleanup job. ~/javasharedresources/ ?

pshipton commented 5 years ago

Sounds right. Please check the machine(s) for a shared cache file containing sharedcc_LOCAL SERVICE in the name.

Mesbah-Alam commented 5 years ago

By fixing the tests, what I understood from the discussion above is: update test logic so that it does not fail the test on the event of failure in cache clean up-- I.e., as @lumpfish mentioned above: " make the delete failure non-fatal" -- I am working on making this update.

pshipton commented 5 years ago

Technically the test should fail if it can't delete the caches it created. Or it should only attempt to delete the caches it created, and continue to fail if it doesn't work.

Mesbah-Alam commented 5 years ago

Technically the test should fail if it can't delete the caches it created. Or it should only attempt to delete the caches it created, and continue to fail if it doesn't work.

pshipton commented 5 years ago

This is 1 of 2 remaining failures (non osx) in the nightly builds for the 0.12 release.

@Mesbah-Alam what is the outlook for fixing?

@jdekonin @AdamBrousseau is it possible to clean the sharedcc_LOCAL SERVICE shared cache from the machines? Or is this related to some other running process? I was assuming it is related to have shared classes enabled by default, but shared classes is no longer enabled by default in the latest builds.

Mesbah-Alam commented 5 years ago

I am currently testing the PR that updates all the tests to only destroy the test specific cache (instead of all).

@pshipton

Mesbah-Alam commented 5 years ago

DESTROY_FAILED_CURRENT_GEN_CACHE seems to be a test issue: https://ci.eclipse.org/openj9/view/Test/job/Test-extended.system-JDK8-win_x86/137/tapResults/

The SharedClassesCacheChecker receives it when it tries to delete the cache it owns itself : DefaultLocationGroupAccessJavaNoIterator. https://github.com/eclipse/openj9-systemtest/issues/78 is opened to fix this.

Mesbah-Alam commented 5 years ago

This has been fixed via https://github.com/eclipse/openj9-systemtest/issues/78

The test has been running fine: https://ci.eclipse.org/openj9/view/Test/job/Test-extended.system-JDK8-win_x86/198/tapResults/

@pshipton - could you please close this issue at this point?