adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 101 forks source link

test-azure-win2012r2-x64-2 / test-azure-win2016-x64-1: openj9 SharedClasses.xxx tests fail (Memory issue?) #1963

Open lumpfish opened 3 years ago

lumpfish commented 3 years ago

The following openj9 shared classed test targets may fail when they land on test-azure-win2012r2-x64-2 or test-azure-win2016-x64-1.

SharedClassesAPI
SharedClasses.SCM01.MultiCL
SharedClasses.SCM01.MultiThread
SharedClasses.SCM01.MultiThreadMultiCL
SharedClasses.SCM23.MultiCL
SharedClasses.SCM23.MultiThread
SharedClasses.SCM23.MultiThreadMultiCL

The symptoms are various out of memory exceptions - e.g.

11:52:21  MT4 stderr JVMDUMP032I JVM requested Snap dump using 'C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc' in response to an event
11:52:21  MT4 stderr JVMDUMP010I Snap dump written to C:\Users\jenkins\workspace\Grinder\openjdk-tests\TKG\output_1613993216963\SharedClasses.SCM23.MultiThread_1\20210222-114232-SharedClasses\results\Snap.20210222.114552.7872.0004.trc
11:52:21  MT4 stderr JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
11:52:21  MT4 stderr Exception in thread "main" java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 22
11:52:21  MT4 stderr    at java.lang.Thread.startImpl(Native Method)
11:52:21  MT4 stderr    at java.lang.Thread.start(Thread.java:993)
11:52:21  MT4 stderr    at net.openj9.test.sc.LoaderSlaveMultiThread.run(LoaderSlaveMultiThread.java:130)
11:52:21  MT4 stderr    at net.openj9.test.sc.LoaderSlaveMultiThread.main(LoaderSlaveMultiThread.java:59)

Their Jenkins links show the machines have 4Gb RAM: https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ - Failed https://ci.adoptopenjdk.net/computer/test-azure-win2016-x64-1/ - Failed

The links for two other machines also show them as having 4Gb memory, but the tests pass on those machines: https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-1/ - Passed https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-3/ - Passed

sxa commented 3 years ago

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

lumpfish commented 3 years ago

This link will run all the above targets: https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=11&JDK_IMPL=openj9&JDK_VENDOR=adoptopenjdk&BUILD_LIST=system&PLATFORM=x86-64_windows_xl&TARGET=testList%20TESTLIST=SharedClassesAPI,SharedClasses.SCM01.MultiCL,SharedClasses.SCM01.MultiThread,SharedClasses.SCM01.MultiThreadMultiCL,SharedClasses.SCM23.MultiCL,SharedClasses.SCM23.MultiThread,SharedClasses.SCM23.MultiThreadMultiCL

karianna commented 3 years ago

Seems likely related to the memory on those machines. Next steps should probably be to verify the swap file settings, whether they can be increased with any effect, and if not we should look to increase the RAM on those systems to 6GB first, then 8GB if that doesn't work.

Could also be filehandles.

sxa commented 3 years ago

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

sxa commented 3 years ago

(I've disabled the win2016 system by removing ci.role.test until this can be debugged/diagnosed)

karianna commented 3 years ago

Could also be filehandles.

What determines available file handles on a per-machine basis? Is that in any way a default set on RAM size or something else?

On Windows? I've actually got no idea.

sxa commented 3 years ago

Testing here with swap space increased on test-azure-win2016-x64-1 (assuming it goes live without a reboot) If that doesn't work I'll increase the RAM to 6Gb

sxa commented 3 years ago

Hmmm 2012r2-2 has 16GB of RAM. Running a Grinder on there too to verify

sxa commented 3 years ago

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

The win2012r2 did give an OutOfMemoryException - have made sure there is up to 12GB of swap and am re-running in this grinder

sxa commented 3 years ago

Win2012 machine showed an OutOfMemory during one of the tests (different one in each run) in 7231 and 7237 I'm going to restart it, run the same test again while trying to watch the usage live on the machine and then see how easy it is to increase to 6GB ([EDIT: no I won't as Azure doens't have 6GB options so it'll have to be 8GB which is almost twice the cost unfortunately ... Maybe I'll just shut down the 2012 one and bump the 2016 up to 8GB B2ms spec)

lumpfish commented 3 years ago

So the Grinder on the win2016 box failued but not with an obvious memory issue - @lumpfish can you check the log of that one to see if it's the same issue you've seen?

That test is similar in that it runs multiple jvms in parallel which share a shared class cache.

The stderr from the failing process (found by downloading the system_test_output.tar.gz file from the failing job (https://ci.adoptopenjdk.net/job/Grinder/7230/) ) contains:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

I've not seen (or noticed) that before.

sxa commented 3 years ago

Hmmm https://ci.adoptopenjdk.net/job/Grinder/7260/ ran through without any failure on azure-win2012r2-2 after an earlier reboot.

Although trying again and this has popped up: image Upgrade time then! (FYI @smlambert looks like Windows tests can't complete on a 4GB Windows system)

sxa commented 3 years ago

I've shut the Windows2012 machine down (it's also more expnsive than the new ones I've set up so shutting it down isn't a bad idea). I'm re-running a Grinder on the 2016 machine 7268 since the previous one passed, and I'll look to bumping it up to 8Gb if it fails (Will still be cheaper than the Win2012 one) [EDIT: 7268 passed - running again on the 4GB Win2016 box at 7277 and 7278

Side note: I'm also running a grinder on one of the larger 2012 boxes at 7269 - mostly because I'm curious as to whether there are any performance differences on that one (But I suspect on the system test suites it won't make much difference)

sxa commented 3 years ago

7277 failed a test but did not through a visible OutOfMemory error so inconclusive

lumpfish commented 3 years ago

7277 failed with the same mutex wait error:

JVMSHRC162E The wait for the creation mutex while opening shared memory has timed out
JVMSHRC662I Error recovery: destroyed semaphore set associated with shared class cache.
JVMSHRC840E Failed to start up the shared cache.
JVMJ9VM015W Initialization error for library j9shr29(11): JVMJ9VM009E J9VMDllMain failed
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
sxa commented 3 years ago

Despite the above tests being inconclusive due to the failure on shared class setup, I'm going to go ahead with

Converted test-azure-win2016-x64-1 from B2s (left) to B2ms (right). Back online with ci.role.test label and queued up two Grinders 7288 and 7299 - hopefully that will resolve the OutOfMemoryErrors if not the class cache issue.

image

sxa commented 3 years ago

I'm going to deprovision https://ci.adoptopenjdk.net/computer/test-azure-win2012r2-x64-2/ (test-2012r2-2 on the azure portal) - we can recreate it if required in the future but it's unfit for purpose in its current state and cannot easily be converted to a cost effective larger system.

sxa commented 3 years ago

7288 failed but https://ci.adoptopenjdk.net/job/Grinder/7301/ succeeded - @lumpfish can you take a look at 7288 and let me know if you're concerned about the failure (in terms of whether it could still be a machine specific one-off)

lumpfish commented 3 years ago

7288 (https://ci.adoptopenjdk.net/job/Grinder/7288/console) looks like it failed with a Jenkins connect issue?

sxa commented 3 years ago

Updated links to re-run:

sxa commented 1 year ago

Re-runs:

sophia-guo commented 6 months ago

We don't run impl=openj9 tests in adoptium , so can win2016 be enabled?