Open pshipton opened 4 years ago
Not sure if there is anything to be done. @jdekonin @AdamBrousseau
There is currently 50G free in /home. Evidence would seem to point to the test requiring lots of space, or there is a lot of failures.
...
03:01:36 There is 49378 Mb free
...
03:42:39 There is 12183 Mb free
...
04:22:51 There is 1053 Mb free
Is 50MB typical (edit: meant 50G) for the AIX machines? The only failure was related to insufficient disk space.
50G not MB and that is just for the /home partition. jenkins runs jobs out of /home/jenkins/workspace/ This is exactly the same as what the older systems were using.
ok, I'll close and see if this happens again.
Similar problem on aix71-p8-10 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/64/
That job starts at 04:43:59 There is 48564 Mb free
and ends with 05:24:59 There is 2598 Mb free
. Why is the test or JVM requiring 48G+ to run?
¯\_(ツ)_/¯
Usually it works. It's possible things were left on machines from the unscheduled jenkins restart that occurred last night.
Today it's aix71-p8-1 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/70
aix71-p8-8 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/76
I'll run a grinder and look at the machine to see what's consuming the disk space.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/2 aix71-p8-1 /home is 100% full. /home/jenkins/workspace/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/openjdk-tests/TKG/test_output_15888359996312: 3.7G ./DaaLoadTest_all_special_15 3.8G ./DaaLoadTest_all_special_23 4.1G ./DaaLoadTest_all_special_26 3.8G ./DaaLoadTest_daa1_special_15 3.6G ./DaaLoadTest_daa1_special_23 3.8G ./DaaLoadTest_daa1_special_26 3.6G ./DaaLoadTest_daa1_special_27 6.5G ./DaaLoadTest_daa2_special_12 3.8G ./DaaLoadTest_daa2_special_15 3.8G ./DaaLoadTest_daa2_special_23 3.8G ./DaaLoadTest_daa2_special_26 4.4G ./DaaLoadTest_daa2_special_27
-rw-r--r-- 1 jenkins staff 3930845818 May 7 04:24 ./DaaLoadTest_all_special_15/20200507-041905-DaaLoadTest/results/core.21364858.07082350
-rw-r--r-- 1 jenkins staff 4002987930 May 7 04:45 ./DaaLoadTest_all_special_23/20200507-044458-DaaLoadTest/results/core.16056334.07084534
-rw-r--r-- 1 jenkins staff 4360081408 May 7 04:48 ./DaaLoadTest_all_special_26/20200507-044546-DaaLoadTest/results/core.17694778.07084822
-rw-r--r-- 1 jenkins staff 4063358290 May 7 03:24 ./DaaLoadTest_daa1_special_15/20200507-032312-DaaLoadTest/results/core.11993262.07072352
-rw-r--r-- 1 jenkins staff 3833249562 May 7 03:36 ./DaaLoadTest_daa1_special_23/20200507-033557-DaaLoadTest/results/core.18284746.07073631
-rw-r--r-- 1 jenkins staff 4043958922 May 7 03:37 ./DaaLoadTest_daa1_special_26/20200507-033642-DaaLoadTest/results/core.13172852.07073720
-rw-r--r-- 1 jenkins staff 3838370738 May 7 03:46 ./DaaLoadTest_daa1_special_27/20200507-033734-DaaLoadTest/results/core.21364774.07074641
-rw-r--r-- 1 jenkins staff 6901813018 May 7 03:47 ./DaaLoadTest_daa2_special_12/20200507-034654-DaaLoadTest/results/core.15007914.07074727
-rw-r--r-- 1 jenkins staff 4069192458 May 7 03:50 ./DaaLoadTest_daa2_special_15/20200507-034854-DaaLoadTest/results/core.17694850.07075002
-rw-r--r-- 1 jenkins staff 4011247218 May 7 04:01 ./DaaLoadTest_daa2_special_23/20200507-035936-DaaLoadTest/results/core.11993208.07080119
-rw-r--r-- 1 jenkins staff 4050709570 May 7 04:02 ./DaaLoadTest_daa2_special_26/20200507-040131-DaaLoadTest/results/core.13172736.07080247
-rw-r--r-- 1 jenkins staff 4715440730 May 7 04:03 ./DaaLoadTest_daa2_special_27/20200507-040300-DaaLoadTest/results/core.18284566.07080334
The 1.DLT.stderr files are empty. The 1.DLT.stdout files are terminated abruptly.
04:23:48.702 - Completed 4.4%. Number of tests started=32 (+0)
04:23:48.802 - Completed 4.4%. Number of tests started=32 (+0)
04:23:48.903 - Completed 4.4%. Number of tests started=32 (+0)
Strangely the tests are passing.
[2020-05-07T08:23:53.774Z] DLT 04:23:48.903 - Completed 4.4%. Number of tests started=32 (+0)
[2020-05-07T08:24:02.032Z] STF 04:24:01.429 - Monitoring Report Summary:
[2020-05-07T08:24:02.032Z] STF 04:24:01.429 - o Process DLT ended with the expected exit code (0)
[2020-05-07T08:24:02.032Z] STF 04:24:01.430 - EXECUTE stage completed
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 -
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 - ==================== T E A R D O W N ====================
[2020-05-07T08:24:02.032Z] STF 04:24:01.931 - Running teardown: perl /home/jenkins/workspace/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/openjdk-tests/TKG/../TKG/test_output_15888359996312/DaaLoadTest_all_special_15/20200507-041905-DaaLoadTest/tearDown.pl
[2020-05-07T08:24:02.798Z] STF 04:24:02.042 - TEARDOWN stage completed
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 -
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - ===================== R E S U L T S =====================
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - Stage results:
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - setUp: pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - execute: pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - teardown: pass
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 -
[2020-05-07T08:24:02.798Z] STF 04:24:02.051 - Overall result: PASSED
[2020-05-07T08:24:02.798Z]
[2020-05-07T08:24:02.798Z] DaaLoadTest_all_special_15_PASSED
Tried to get a core, but the files were cleaned up before I could.
12: variation: Mode351 JVM_OPTIONS: -Xgcpolicy:metronome -Xcompressedrefs
15: variation: Mode553 JVM_OPTIONS: -XX:+UseCompressedOops -Xgcpolicy:balanced -Xjit:count=0
23: variation: Mode688 JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -Xgcpolicy:gencon -Xaggressive -Xconcurrentlevel0
26: variation: Mode610-OSRG JVM_OPTIONS: -Xcompressedrefs -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xgcpolicy:gencon
27: variation: Mode610-OSRG JVM_OPTIONS: -Xcompressedrefs -Xjit:enableOSR,enableOSROnGuardFailure,count=1,disableAsyncCompilation -Xgcpolicy:gencon
Started a new build in order to look at / get a core. https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/3/
@llxia Do the new parallel builds cleanWs after each test bucket. By the looks of Pete's comment they don't.
The results were cleaned when the job ended. I was able to look around before the job finished.
Ah ok my mistake. I thought at first those were all different branches of the parallel builds.
The rebuild ran on aix71-p8-4 and passed. Started another to run on aix71-p8-1 https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/4
The rebuild on aix71-p8-1 isn't yet completed, but it's run a number of the tests that had cores before, but there are no cores in the rebuild.
Finally managed to catch this in action.
Segmentation fault in Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int) at 0x90000000d47c054 ($t5)
0x90000000d47c054 (findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)+0x54) f821ff41 stdu r1,-192(r1)
(dbx) where
const.findCycle is repeated many many times, removed most of them.
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)() at 0x90000000d47c4e8
TR_RegionStructure::checkForInternalCycles()() at 0x90000000d47bf44
TR_RegionStructure::cleanupAfterEdgeRemoval(TR::CFGNode*)() at 0x90000000d47bad4
TR_RegionStructure::removeEdge(TR_Structure*,TR_Structure*)() at 0x90000000d7491d0
Structure.std::logic_error::what() const.TR_BlockStructure::removeEdge(TR_Structure*,TR_Structure*)() at 0x90000000c9a3bac
OMR::CFG::removeEdge(TR::CFGEdge*)() at 0x90000000cbbc9a0
TR_OSRGuardInsertion::removeHCRGuards(TR_BitVector&,TR_HCRGuardAnalysis*)() at 0x90000000d722824
TR_OSRGuardInsertion::perform()() at 0x90000000d706044
OMR::Optimizer::performOptimization(const OptimizationStrategy*,int,int,int)() at 0x90000000cc4d7c4
OMR::Optimizer::optimize()() at 0x90000000cc4b2b4
OMR::Compilation::compile()() at 0x90000000cc46968
CompilationThread.TR::CompilationInfoPerThreadBase::compile(J9VMThread*,TR::Compilation*,TR_ResolvedMethod*,TR_J9VMBase&,TR_OptimizationPlan*,const TR::SegmentAllocator&)() at 0x90000000cc43520
TR::CompilationInfoPerThreadBase::wrappedCompile(J9PortLibrary*,void*)() at 0x90000000cc2ca8c
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
CompilationThread.TR::CompilationInfoPerThreadBase::compile(J9VMThread*,TR_MethodToBeCompiled*,J9::J9SegmentProvider&)() at 0x90000000cc11334
TR::CompilationInfoPerThread::processEntry(TR_MethodToBeCompiled&,J9::J9SegmentProvider&)() at 0x90000000cf8978c
TR::CompilationInfoPerThread::processEntries()() at 0x90000000cf8bcc8
protectedCompilationThreadProc(J9PortLibrary*,TR::CompilationInfoPerThread*)() at 0x90000000cf8af94
omrsig_protect(??, ??, ??, ??, ??, ??, ??), line 425 in "omrsignal.c"
compilationThreadProc(void*)() at 0x90000000cf8a770
thread_wrapper(arg = (nil)), line 1714 in "omrthread.c"
@gita-omr @AlenBadel I assume this a dup of https://github.com/eclipse/omr/issues/5220 ?
@Mesbah-Alam can you please look at why the tests are getting a seg fault, generating core, but still passing. https://github.com/eclipse/openj9/issues/9382#issuecomment-625283280
The stack looks similar, but what we've been seeing is a stack overflow rather than a segmentation fault. Could it be an invalid TR_StructureSubGraphNode within the graph?
Tried to grab a core file just in case, but the test timed out and removed the directory before I was finished. Can try again the next time I see this, if necessary.
@IBMJimmyk fyi.
Didn't re-occur in the builds last night.
From the previous comment
Segmentation fault in Structure.std::logic_error::what() const.findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int) at 0x90000000d47c054 ($t5)
0x90000000d47c054 (findCycle(TR_StructureSubGraphNode*,TR_BitVector&,TR_BitVector&,TR_BitVector&,int)+0x54) f821ff41 stdu r1,-192(r1)
It crashed at stdu r1,-192(r1)
and that's updating the stack pointer. Combine that with the long stack of calls to findCycle
and I think this is still the same stack overflow problem.
If possible you could try looking at what value was in $r1. If it's complete garbage, that's a bad sign and the problem might be something else. But I suspect it will just be right beside a very large stack that is holding the stack frames from all the findCycle
calls.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_daaLoadTest/17
Captured a jextact'ed core. I put it into /team/triage/openj9-9382, but if you want it somewhere else, like Box, let me know.
(dbx) registers
$r0:0x0000000000000000 $stkp:0x0000010020980080 $toc:0x08001000a01d7a08
$r3:0x00000000000000f6 $r4:0x0000010020a84190 $r5:0x0000010020a84150
$r6:0x0000010020a84170 $r7:0x0000000000000000 $r8:0x0000000000000000
$r9:0x0000010020a84200 $r10:0x00000000000000f7 $r11:0x0000000000000800
$r12:0x0000000000000000 $r13:0x0000010020a98800 $r14:0x0000000000004e48
$r15:0x0a00000030000000 $r16:0x0000000000000000 $r17:0x0000000000000011
$r18:0x0a00000030011a80 $r19:0x0a00000030011a80 $r20:0x0000000000000000
$r21:0x0000000000000001 $r22:0x0000000000000001 $r23:0xfffffffffffff800
$r24:0x0000000000003df4 $r25:0x0000000000000000 $r26:0x0000000000000000
$r27:0x0000010020a84150 $r28:0x0000000000000001 $r29:0x0000010020a84190
$r30:0x0000010020a84170 $r31:0x0a00000040ffca90
$iar:0x090000000ae92054 $msr:0xa00000000000d032 $cr:0x42204244
$link:0x090000000ae924ec $ctr:0x0000000000000000 $xer:0x00000008
Condition status = 0:g 1:e 2:e 4:g 5:e 6:g 7:g
The OMR change https://github.com/eclipse/omr/pull/5221 is in the jdk11 build last night, and this problem wasn't seen. I'll go ahead and close this, will reopen if it's seen again.
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly/58/ aix71-p8-5
Not sure if the above is the first occurrence. No longer have build results for the previous build, not sure if it passed the tests, but the shas are: