adoptium / openj9-systemtest

Long running J9 tests
Other
5 stars 38 forks source link

#74 causing asserts on Windows, and failures on osx #75

Open pshipton opened 5 years ago

pshipton commented 5 years ago

I assume these are the result of #74 and #71

https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-win_x86-64_cmprssptrs/141/ https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-win_x86-64_cmprssptrs/137/

https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/36/ https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-osx_x86-64_cmprssptrs/53/

pshipton commented 5 years ago

@Mesbah-Alam

pshipton commented 5 years ago

Windows

JVMSHRC777W The softmx limit for shared cache usage is smaller than the minimum feasible value 3980759048 bytes.
08:16:49.391 0x1bd9a00   j9shr.1012   *   ** ASSERTION FAILED ** at CompositeCache.cpp:6276: (((0 != _theca) && hasWriteMutex(currentThread) && (getTotalSize() >= softMaxBytes) && (softMaxBytes >= getUsedBytes())))

osx JVMSHRC756W Failed to set group access permission on the shared cache file as requested by the 'groupAccess' sub-option.

JVMSHRC659E An error has occurred while opening shared memory
JVMSHRC336E Port layer error code = -393970
JVMSHRC337E Platform error message: shmget : Cannot allocate memory
JVMSHRC029E Not enough memory left on the system
JVMSHRC663I Error recovery: destroyed semaphore set with id=65622 associated with shared class cache.
pshipton commented 5 years ago

Failed to set group access permission sounds like a machine problem. Not sure if this was occurring in the passing runs since there is no results file.

Mesbah-Alam commented 5 years ago

Related: https://github.com/eclipse/openj9/issues/4375

pshipton commented 5 years ago

We are getting the assert again https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-win_x86-64_cmprssptrs/146/

JVMSHRC777W The softmx limit for shared cache usage is smaller than the minimum feasible value 3980759048 bytes.
10:57:18.897 0xf39a00   j9shr.1012   *   ** ASSERTION FAILED ** at CompositeCache.cpp:6276: (((0 != _theca) && hasWriteMutex(currentThread) && (getTotalSize() >= softMaxBytes) && (softMaxBytes >= getUsedBytes())))

@hangshao0 please help figure this out.

pshipton commented 5 years ago

0.12 release build https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-win_x86-64_cmprssptrs/144

hangshao0 commented 5 years ago

23:43:37.432892070 0x0000000000ee9a00 j9prt.58 Exit <j9shmem_stat exit - contents of controlFile C:\Users\jenkins\workspace\Test-extended.system-JDK8-win_x86-64_cmprssptrs\openjdk-tests\TestConfig\test_output_15477678633841\SharedClassesAPI_0\20190117-174310-SharedClassesAPI\results\caches\javasharedresources\C290M4F1A64_DefinedLocationGroupAccessJavaWL4_G35 is corrupted

23:43:37.432899289 0x0000000000ee9a00 j9shr.2233 Exception * SH_OSCachesysv::initializeHeader: The softmx bytes passed in is too big, it is reset to the cache file size 0

The cache file size is 0 and has been corrupted.

Mesbah-Alam commented 5 years ago

There were updates made to the test by this PR yesterday: https://github.com/eclipse/openj9-systemtest/pull/79.

A 5x Grinder was run at OpenJ9 for it and it passed: https://ci.eclipse.org/openj9/view/Test/job/Test-Grinder/245/consoleFull

The grinder ran on https://ci.eclipse.org/openj9/computer/win2012r2-x86-1/

The release build, in which the test failed, ran on the same machine too : https://ci.eclipse.org/openj9/computer/win2012r2-x86-1/

hangshao0 commented 5 years ago
GEN 00:16:02.085 -     5  execute  Run java          DefaultLocationGroupAccessJava: Start workload for DefaultLocationGroupAccessJavaWL1
GEN 00:16:02.085 -     6  execute  Run java          DefaultLocationGroupAccessJava: Start workload for DefaultLocationGroupAccessJavaWL2
GEN 00:16:02.085 -     7  execute  Run java          DefaultLocationGroupAccessJava: Start workload for DefaultLocationGroupAccessJavaWL3
GEN 00:16:02.085 -     8  execute  Run java          DefaultLocationGroupAccessJava: Start workload for DefaultLocationGroupAccessJavaWL4
...
...
GEN 00:16:02.085 -    27  execute  Run java          DefinedLocationGroupAccessJava: Start workload for DefinedLocationGroupAccessJavaWL1
GEN 00:16:02.085 -    28  execute  Run java          DefinedLocationGroupAccessJava: Start workload for DefinedLocationGroupAccessJavaWL2
GEN 00:16:02.085 -    29  execute  Run java          DefinedLocationGroupAccessJava: Start workload for DefinedLocationGroupAccessJavaWL3
GEN 00:16:02.085 -    30  execute  Run java          DefinedLocationGroupAccessJava: Start workload for DefinedLocationGroupAccessJavaWL4

Step 7 & 8 always pass and step 29 & 30 always fail. The only difference I notice is that a cacheDir parameter is passed to step 29 & 30.

23:43:37.432873001  0x0000000000ee9a00 j9shr.510           Exit       <shmemOpenWrapper exiting, return code = 110, specLeng = 314572800
23:43:37.432879661  0x0000000000ee9a00 j9prt.56            Entry      >j9shmem_stat entered, name = C290M4F1A64_DefinedLocationGroupAccessJavaWL4_G35
23:43:37.432892070  0x0000000000ee9a00 j9prt.58            Exit       <j9shmem_stat exit - contents of controlFile C:\Users\jenkins\workspace\Test-extended.system-JDK8-win_x86-64_cmprssptrs\openjdk-tests\TestConfig\test_output_15477678633841\SharedClassesAPI_0\20190117-174310-SharedClassesAPI\results\caches\javasharedresources\\C290M4F1A64_DefinedLocationGroupAccessJavaWL4_G35 is corrupted

j9shr.510 shows the cache file is created successfully (return code 110 means J9PORT_INFO_SHMEM_CREATED). But the trace point j9prt.58 - Trc_PRT_shmem_j9shmem_stat_Exit2 shows up right after that. It is triggered only when CreateFileW() returns NULL or INVALID_HANDLE_VALUE.

memHandle = CreateFileW(unicodePath, GENERIC_READ, (FILE_SHARE_READ | FILE_SHARE_WRITE), NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (unicodeBuffer != unicodePath) {
        omrmem_free_memory(unicodePath);
    }

    if ((NULL == memHandle) || (INVALID_HANDLE_VALUE==memHandle)) {
        Trc_PRT_shmem_j9shmem_stat_Exit2(sharedMemoryFullPath);        <---------- triggered
        omrmem_free_memory(sharedMemoryFullPath);
        return -1;
    }

So it means the cache file is created then immediately deleted.
I guess it might be related to the cacheDir being used in step 29 & 30. (C:\Users\jenkins\workspace\Test-extended.system-JDK8-win_x86-64_cmprssptrs\openjdk-tests\TestConfig\test_output_15477678633841\SharedClassesAPI_0\20190117-174310-SharedClassesAPI\results\caches). The default cacheDir in Step 7 & 8 works fine.

hangshao0 commented 5 years ago

I've created a PR to include the error code in the trace point: https://github.com/eclipse/openj9/pull/4463

hangshao0 commented 5 years ago

https://github.com/eclipse/openj9-systemtest/issues/75#issuecomment-457631704

Also @Mesbah-Alam launched a 5x grinder (Windows JDK8) in internal Jenkins yesterday, all passed.

pshipton commented 5 years ago

Perhaps the failure is machine specific? Grinders only run on a single machine.

pshipton commented 5 years ago

Although Mesba says in the comment that the same machine was used.

pshipton commented 5 years ago

If it fails in the nightly builds but not in a grinder, maybe one of the other tests is leaving a process running.

Mesbah-Alam commented 5 years ago

@hangshao0 - you mentioned about the default limit, in osx, for the maximum number of shared classes cache, that is allowed to be created - 3. Just curious, is the limit same on Windows too?

hangshao0 commented 5 years ago

Just curious, is the limit same on Windows too?

No.

The length of path "C:\Users\jenkins\workspace\Test-extended.system-JDK8-win_x86-64_cmprssptrs\openjdk-tests\TestConfig\test_output_15477678633841\SharedClassesAPI_0\20190117-174310-SharedClassesAPI\results\caches\javasharedresources\\C290M4F1A64_DefinedLocationGroupAccessJavaWL4_G35" is 262, which is greater than MAX_PATH(=260) on Windows. The cacheDir/cacheName is too long. The test needs to be updated to use a shorter cacheDir/cacheName. @Mesbah-Alam

The JVM should do a better job reporting this error.

Mesbah-Alam commented 5 years ago

@hangshao0 - I will have a pr with changed path for non-default cache location.

hangshao0 commented 5 years ago

CacheDir used in grinder is shorter than the one used in builds. So it passed in grinder but failed in builds.

CacheDir in grinder:
C:\Users\jenkins\workspace\Test-Grinder\openjdk-tests\TestConfig\test_output_15483611858909\SharedClassesAPI_0\20190124-141950-SharedClassesAPI\results\caches
CacheDir in builds:
C:\Users\jenkins\workspace\Test-extended.system-JDK8-win_x86-64_cmprssptrs\openjdk-tests\TestConfig\test_output_15477678633841\SharedClassesAPI_0\20190117-174310-SharedClassesAPI\results\caches
Mesbah-Alam commented 5 years ago

Path has been shortened in https://github.com/eclipse/openj9-systemtest/pull/80. They should now be less than MAX_PATH(=260).

hangshao0 commented 5 years ago

This test is now passing on Windows and OSX

Windows JDK8:

22:20:49 STF 21:20:51.524 - Overall result: PASSED
22:20:49 
22:20:49 SharedClassesAPI_0_PASSED

from https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-win_x86-64_cmprssptrs/152/consoleFull

Window JDK11: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-win_x86-64_cmprssptrs/152/

OSX JDK8: https://ci.eclipse.org/openj9/job/Test-extended.system-JDK8-osx_x86-64_cmprssptrs/51/

OSX JDK11:

03:01:00 STF 00:00:59.275 - Overall result: PASSED
03:01:00 
03:01:00 SharedClassesAPI_0_PASSED

from https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-osx_x86-64_cmprssptrs/68/consoleFull

Mesbah-Alam commented 5 years ago

The JVM should do a better job reporting this error.

@hangshao0 - do we have an issue opened for this?

hangshao0 commented 5 years ago

do we have an issue opened for this?

I will create a pull request to fix this shortly.