eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Core files generated on AIX running DAA modes #9382

Open pshipton opened 4 years ago

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/58/ aix71-p8-5

OpenJ9: db73f46
OMR: cfd6690
OpenJDK11: 1553321 - openj9-openjdk-jdk11

Not sure if the above is the first occurrence. No longer have build results for the previous build, not sure if it passed the tests, but the shas are:

OpenJ9: 7d740b4
OMR: e1ac81b
OpenJDK11: 1553321 - openj9-openjdk-jdk11
pshipton commented 4 years ago

Not sure if there is anything to be done. @jdekonin @AdamBrousseau

jdekonin commented 4 years ago

There is currently 50G free in /home. Evidence would seem to point to the test requiring lots of space, or there is a lot of failures.

...
03:01:36  There is 49378 Mb free
...
03:42:39  There is 12183 Mb free
...
04:22:51  There is 1053 Mb free
pshipton commented 4 years ago

Is 50MB typical (edit: meant 50G) for the AIX machines? The only failure was related to insufficient disk space.

jdekonin commented 4 years ago

50G not MB and that is just for the /home partition. jenkins runs jobs out of /home/jenkins/workspace/ This is exactly the same as what the older systems were using.

pshipton commented 4 years ago

ok, I'll close and see if this happens again.

pshipton commented 4 years ago

Similar problem on aix71-p8-10 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/64/

jdekonin commented 4 years ago

That job starts at 04:43:59 There is 48564 Mb free and ends with 05:24:59 There is 2598 Mb free. Why is the test or JVM requiring 48G+ to run?

pshipton commented 4 years ago

¯\_(ツ)_/¯

pshipton commented 4 years ago

Usually it works. It's possible things were left on machines from the unscheduled jenkins restart that occurred last night.

pshipton commented 4 years ago

Today it's aix71-p8-1 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/70

pshipton commented 4 years ago

aix71-p8-8 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/76

I'll run a grinder and look at the machine to see what's consuming the disk space.

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/1 aix71-p8-8

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/2 aix71-p8-1 /home is 100% full. /home/jenkins/workspace/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/openjdk-tests/TKG/test_output_15888359996312: 3.7G ./DaaLoadTest_all_special_15 3.8G ./DaaLoadTest_all_special_23 4.1G ./DaaLoadTest_all_special_26 3.8G ./DaaLoadTest_daa1_special_15 3.6G ./DaaLoadTest_daa1_special_23 3.8G ./DaaLoadTest_daa1_special_26 3.6G ./DaaLoadTest_daa1_special_27 6.5G ./DaaLoadTest_daa2_special_12 3.8G ./DaaLoadTest_daa2_special_15 3.8G ./DaaLoadTest_daa2_special_23 3.8G ./DaaLoadTest_daa2_special_26 4.4G ./DaaLoadTest_daa2_special_27

-rw-r--r-- 1 jenkins staff 3930845818 May  7 04:24 ./DaaLoadTest_all_special_15/20200507-041905-DaaLoadTest/results/core.21364858.07082350
-rw-r--r-- 1 jenkins staff 4002987930 May  7 04:45 ./DaaLoadTest_all_special_23/20200507-044458-DaaLoadTest/results/core.16056334.07084534
-rw-r--r-- 1 jenkins staff 4360081408 May  7 04:48 ./DaaLoadTest_all_special_26/20200507-044546-DaaLoadTest/results/core.17694778.07084822
-rw-r--r-- 1 jenkins staff 4063358290 May  7 03:24 ./DaaLoadTest_daa1_special_15/20200507-032312-DaaLoadTest/results/core.11993262.07072352
-rw-r--r-- 1 jenkins staff 3833249562 May  7 03:36 ./DaaLoadTest_daa1_special_23/20200507-033557-DaaLoadTest/results/core.18284746.07073631
-rw-r--r-- 1 jenkins staff 4043958922 May  7 03:37 ./DaaLoadTest_daa1_special_26/20200507-033642-DaaLoadTest/results/core.13172852.07073720
-rw-r--r-- 1 jenkins staff 3838370738 May  7 03:46 ./DaaLoadTest_daa1_special_27/20200507-033734-DaaLoadTest/results/core.21364774.07074641
-rw-r--r-- 1 jenkins staff 6901813018 May  7 03:47 ./DaaLoadTest_daa2_special_12/20200507-034654-DaaLoadTest/results/core.15007914.07074727
-rw-r--r-- 1 jenkins staff 4069192458 May  7 03:50 ./DaaLoadTest_daa2_special_15/20200507-034854-DaaLoadTest/results/core.17694850.07075002
-rw-r--r-- 1 jenkins staff 4011247218 May  7 04:01 ./DaaLoadTest_daa2_special_23/20200507-035936-DaaLoadTest/results/core.11993208.07080119
-rw-r--r-- 1 jenkins staff 4050709570 May  7 04:02 ./DaaLoadTest_daa2_special_26/20200507-040131-DaaLoadTest/results/core.13172736.07080247
-rw-r--r-- 1 jenkins staff 4715440730 May  7 04:03 ./DaaLoadTest_daa2_special_27/20200507-040300-DaaLoadTest/results/core.18284566.07080334

The 1.DLT.stderr files are empty. The 1.DLT.stdout files are terminated abruptly.

04:23:48.702 - Completed 4.4%. Number of tests started=32 (+0)
04:23:48.802 - Completed 4.4%. Number of tests started=32 (+0)
04:23:48.903 - Completed 4.4%. Number of tests started=32 (+0)

Strangely the tests are passing.

[2020-05-07T08:23:53.774Z] DLT 04:23:48.903 - Completed 4.4%. Number of tests started=32 (+0)
[2020-05-07T08:24:02.032Z] STF 04:24:01.429 - Monitoring Report Summary:
[2020-05-07T08:24:02.032Z] STF 04:24:01.429 -   o Process DLT ended with the expected exit code (0)
[2020-05-07T08:24:02.032Z] STF 04:24:01.430 - EXECUTE stage completed
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 - 
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 - ====================   T E A R D O W N   ====================
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 - Running teardown: perl /home/jenkins/workspace/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/openjdk-tests/TKG/../TKG/test_output_15888359996312/DaaLoadTest_all_special_15/20200507-041905-DaaLoadTest/tearDown.pl
[2020-05-07T08:24:02.798Z] STF 04:24:02.042 - TEARDOWN stage completed
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - 
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - =====================   R E S U L T S   =====================
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - Stage results:
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 -   setUp:     pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 -   execute:   pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 -   teardown:  pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - 
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - Overall result: PASSED
[2020-05-07T08:24:02.798Z] 
[2020-05-07T08:24:02.798Z] DaaLoadTest_all_special_15_PASSED

Tried to get a core, but the files were cleaned up before I could.

12: variation: Mode351 JVM_OPTIONS: -Xgcpolicy:metronome -Xcompressedrefs

15: variation: Mode553 JVM_OPTIONS: -XX:+UseCompressedOops -Xgcpolicy:balanced -Xjit:count=0

23: variation: Mode688 JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -Xgcpolicy:gencon -Xaggressive -Xconcurrentlevel0

26: variation: Mode610-OSRG JVM_OPTIONS: -Xcompressedrefs -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xgcpolicy:gencon

27: variation: Mode610-OSRG JVM_OPTIONS: -Xcompressedrefs -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xgcpolicy:gencon

Started a new build in order to look at / get a core. https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/3/

AdamBrousseau commented 4 years ago

@llxia Do the new parallel builds cleanWs after each test bucket. By the looks of Pete's comment they don't.

pshipton commented 4 years ago

The results were cleaned when the job ended. I was able to look around before the job finished.

AdamBrousseau commented 4 years ago

Ah ok my mistake. I thought at first those were all different branches of the parallel builds.

pshipton commented 4 years ago

The rebuild ran on aix71-p8-4 and passed. Started another to run on aix71-p8-1 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/4

pshipton commented 4 years ago

The rebuild on aix71-p8-1 isn't yet completed, but it's run a number of the tests that had cores before, but there are no cores in the rebuild.

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Personal_daaLoadTest/1 aix71-p8-6

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/8 aix71-p8-9

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/12 aix71-p8-9

pshipton commented 4 years ago

Finally managed to catch this in action.

Segmentation fault in Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int) at 0x90000000d47c054 ($t5)
0x90000000d47c054 (findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)+0x54) f821ff41           stdu   r1,-192(r1)
(dbx) where

const.findCycle is repeated many many times, removed most of them.

Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
TR_RegionStructure::checkForInternalCycles()() at 0x90000000d47bf44
TR_RegionStructure::cleanupAfterEdgeRemoval(TR::CFGNode*)() at 0x90000000d47bad4
TR_RegionStructure::removeEdge(TR_Structure*,TR_Structure*)() at 0x90000000d7491d0
Structure.std::logic_error::what() const.TR_BlockStructure::removeEdge(TR_Structure*,TR_Structure*)() at 0x90000000c9a3bac
OMR::CFG::removeEdge(TR::CFGEdge*)() at 0x90000000cbbc9a0
TR_OSRGuardInsertion::removeHCRGuards(TR_BitVector&,TR_HCRGuardAnalysis*)() at 0x90000000d722824
TR_OSRGuardInsertion::perform()() at 0x90000000d706044
OMR::Optimizer::performOptimization(const OptimizationStrategy*,int,int,int)() at 0x90000000cc4d7c4
OMR::Optimizer::optimize()() at 0x90000000cc4b2b4
OMR::Compilation::compile()() at 0x90000000cc46968
CompilationThread.TR::CompilationInfoPerThreadBase::compile(J9VMThread*,TR::Compilation*,TR_ResolvedMethod*,TR_J9VMBase&,TR_OptimizationPlan*,const TR::SegmentAllocator&)() at 0x90000000cc43520
TR::CompilationInfoPerThreadBase::wrappedCompile(J9PortLibrary*,void*)() at 0x90000000cc2ca8c
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
CompilationThread.TR::CompilationInfoPerThreadBase::compile(J9VMThread*,TR_MethodToBeCompiled*,J9::J9SegmentProvider&)() at 0x90000000cc11334
TR::CompilationInfoPerThread::processEntry(TR_MethodToBeCompiled&,J9::J9SegmentProvider&)() at 0x90000000cf8978c
TR::CompilationInfoPerThread::processEntries()() at 0x90000000cf8bcc8
protectedCompilationThreadProc(J9PortLibrary*,TR::CompilationInfoPerThread*)() at 0x90000000cf8af94
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
compilationThreadProc(void*)() at 0x90000000cf8a770
thread_wrapper(arg = (nil)), line 1714 in "omrthread.c"
pshipton commented 4 years ago

@gita-omr @AlenBadel I assume this a dup of https://github.com/eclipse/omr/issues/5220 ?

pshipton commented 4 years ago

@Mesbah-Alam can you please look at why the tests are getting a seg fault, generating core, but still passing. https://github.com/eclipse/openj9/issues/9382#issuecomment-625283280

AlenBadel commented 4 years ago

The stack looks similar, but what we've been seeing is a stack overflow rather than a segmentation fault. Could it be an invalid TR_StructureSubGraphNode within the graph?

pshipton commented 4 years ago

Tried to grab a core file just in case, but the test timed out and removed the directory before I was finished. Can try again the next time I see this, if necessary.

gita-omr commented 4 years ago

@IBMJimmyk fyi.

pshipton commented 4 years ago

Didn't re-occur in the builds last night.

IBMJimmyk commented 4 years ago

From the previous comment

Segmentation fault in Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int) at 0x90000000d47c054 ($t5)
0x90000000d47c054 (findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)+0x54) f821ff41           stdu   r1,-192(r1)

It crashed at stdu r1,-192(r1) and that's updating the stack pointer. Combine that with the long stack of calls to findCycle and I think this is still the same stack overflow problem.

If possible you could try looking at what value was in $r1. If it's complete garbage, that's a bad sign and the problem might be something else. But I suspect it will just be right beside a very large stack that is holding the stack frames from all the findCycle calls.

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/17

Captured a jextact'ed core. I put it into /team/triage/openj9-9382, but if you want it somewhere else, like Box, let me know.

pshipton commented 4 years ago
(dbx) registers
  $r0:0x0000000000000000  $stkp:0x0000010020980080   $toc:0x08001000a01d7a08  
  $r3:0x00000000000000f6    $r4:0x0000010020a84190    $r5:0x0000010020a84150  
  $r6:0x0000010020a84170    $r7:0x0000000000000000    $r8:0x0000000000000000  
  $r9:0x0000010020a84200   $r10:0x00000000000000f7   $r11:0x0000000000000800  
 $r12:0x0000000000000000   $r13:0x0000010020a98800   $r14:0x0000000000004e48  
 $r15:0x0a00000030000000   $r16:0x0000000000000000   $r17:0x0000000000000011  
 $r18:0x0a00000030011a80   $r19:0x0a00000030011a80   $r20:0x0000000000000000  
 $r21:0x0000000000000001   $r22:0x0000000000000001   $r23:0xfffffffffffff800  
 $r24:0x0000000000003df4   $r25:0x0000000000000000   $r26:0x0000000000000000  
 $r27:0x0000010020a84150   $r28:0x0000000000000001   $r29:0x0000010020a84190  
 $r30:0x0000010020a84170   $r31:0x0a00000040ffca90  
 $iar:0x090000000ae92054   $msr:0xa00000000000d032    $cr:0x42204244  
$link:0x090000000ae924ec   $ctr:0x0000000000000000   $xer:0x00000008  

          Condition status = 0:g 1:e 2:e 4:g 5:e 6:g 7:g 
pshipton commented 4 years ago

The OMR change https://github.com/eclipse/omr/pull/5221 is in the jdk11 build last night, and this problem wasn't seen. I'll go ahead and close this, will reopen if it's seen again.