Open M-Davies opened 4 years ago
Tagging as jit
22:51:36 [2020-03-25 22:51:33,041] Agent[1]: stderr: Type=Segmentation error vmState=0x000561ad
22:51:36 [2020-03-25 22:51:33,042] Agent[1]: stderr: Method_being_compiled=sun/nio/cs/ext/GB18030$Decoder.decodeBufferLoop(Ljava/nio/ByteBuffer;Ljava/nio/CharBuffer;)Ljava/nio/charset/CoderResult;
22:51:36 [2020-03-25 22:51:33,042] Agent[1]: stderr: Target=2_90_20200325_111 (Linux 4.4.0-170-generic)
fyi @fjeremic
Launching a grinder to find reproducibility rate: https://ci.adoptopenjdk.net/job/Grinder/2585/
Grinder is showing 1/10 failure rate. Unfortunately debug symbols are not present in the JVM which ran this build (#8874 is tracking the fix), so I had to work around to get the symbols. I downloaded the same nightly build [1] and loaded the core file from the failure job.
gdb
fails to load all the shared libraries because the path on the grinder machine is different. We can fix this by supplying the correct path via set solib-search-path
to add the directory where all the missing shared libraries are supposed to be in the build we downloaded from [1] on our local system. gdb
is then able to load the symbols from the shared library. Still however the bt
command does not work for whatever reason. We do have an escape hatch though as the symbol addresses are printed in the original standard output of the failure, so we can create a quick gdb
script:
define printBackTrace
info symbol 0x000003FF9E239332
info symbol 0x000003FF9E23A022
info symbol 0x000003FF9E238986
info symbol 0x000003FF9E23A022
info symbol 0x000003FF9E238986
info symbol 0x000003FF9E23A022
info symbol 0x000003FF9E23F576
info symbol 0x000003FF9E47E982
info symbol 0x000003FF9E4FD41E
info symbol 0x000003FF9E4FD604
info symbol 0x000003FF9E4FCBA6
info symbol 0x000003FF9E4FD23A
info symbol 0x000003FF9E3C3E34
info symbol 0x000003FF9E3C65D8
info symbol 0x000003FF9E3C5894
info symbol 0x000003FF9E3C6D80
info symbol 0x000003FF9E1769EC
info symbol 0x000003FF9DDF07FA
info symbol 0x000003FF9DDF172A
info symbol 0x000003FF9F622420
info symbol 0x000003FF9DDF3430
info symbol 0x000003FF9DDF3A76
info symbol 0x000003FF9DDEF448
info symbol 0x000003FF9DDEF710
info symbol 0x000003FF9DDEF7F8
info symbol 0x000003FF9F622420
info symbol 0x000003FF9DDEFD1A
info symbol 0x000003FF9F7058F0
info symbol 0x000003FFA0687934
info symbol 0x000003FFA046DCE2
end
Running this script yields:
>>> printBackTrace
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeNodeIfPredecessorsAnalyzed(TR_RegionStructure*, TR_BitVector&) + 3570 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeRegionStructure(TR_RegionStructure*, bool) + 858 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeNodeIfPredecessorsAnalyzed(TR_RegionStructure*, TR_BitVector&) + 1094 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeRegionStructure(TR_RegionStructure*, bool) + 858 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeNodeIfPredecessorsAnalyzed(TR_RegionStructure*, TR_BitVector&) + 1094 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeRegionStructure(TR_RegionStructure*, bool) + 858 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_BasicDFSetAnalysis<TR_BitVector*>::performAnalysis(TR_Structure*, bool) + 158 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_ReachingDefinitions::perform() + 178 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_UseDefInfo::_runReachingDefinitions(TR_ReachingDefinitions&, TR_UseDefInfo::AuxiliaryData&) + 94 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_UseDefInfo::performAnalysis(TR_UseDefInfo::AuxiliaryData&) + 164 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_UseDefInfo::prepareUseDefInfo(bool, bool, bool, bool) + 3142 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR_UseDefInfo::TR_UseDefInfo(TR::Compilation*, TR::CFG*, TR::Optimizer*, bool, bool, bool, bool, bool, bool, bool) + 810 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
OMR::Optimizer::createUseDefInfo(TR::Compilation*, bool, bool, bool, bool, bool, bool) + 220 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
OMR::Optimizer::performOptimization(OptimizationStrategy const*, int, int, int) + 6936 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
OMR::Optimizer::performOptimization(OptimizationStrategy const*, int, int, int) + 3540 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
OMR::Optimizer::optimize() + 496 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
OMR::Compilation::compile() + 3236 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR::CompilationInfoPerThreadBase::compile(J9VMThread*, TR::Compilation*, TR_ResolvedMethod*, TR_J9VMBase&, TR_OptimizationPlan*, TR::SegmentAllocator const&) + 1282 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR::CompilationInfoPerThreadBase::wrappedCompile(J9PortLibrary*, void*) + 1026 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
omrsig_protect + 872 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9prt29.so
TR::CompilationInfoPerThreadBase::compile(J9VMThread*, TR_MethodToBeCompiled*, J9::J9SegmentProvider&) + 1088 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR::CompilationInfoPerThread::processEntry(TR_MethodToBeCompiled&, J9::J9SegmentProvider&) + 590 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.soTR::CompilationInfoPerThread::processEntries() + 1088 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
TR::CompilationInfoPerThread::run() + 56 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
protectedCompilationThreadProc(J9PortLibrary*, TR::CompilationInfoPerThread*) + 160 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
omrsig_protect + 872 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9prt29.so
compilationThreadProc(void*) + 578 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9jit29.so
thread_wrapper + 280 in section .text of /sandbox1/fjeremic/temp/jdk8u252-b07/jre/lib/s390x/default/libj9thr29.so
pthread_create@@GLIBC_2.2 + 3044 in section .text of /lib/s390x-linux-gnu/libpthread.so.0
putsgent + 258 in section .text of /lib/s390x-linux-gnu/libc.so.6
So the issue happens in the optimizer while creating usedef info. @PushkarBettadpur could you please help the optimizer team investigate this one? Let's help the experts collect trace logs with necessary tracing information so we can determine what went wrong. As a first step please reach out to @andrewcraik and/or @cathyzhyi to figure out what tracing options we may need, then let's help collect the logging.
The dataflow engine has been very stable - I would suggest running traceFull paranoidOptCheck and see if we can make sure the trees are valid before the opt.
Took a stab at this recently and unfortunately what I'm finding is that once I start adding any tracing options, the failure isn't reproducible anymore. Perhaps traceFull
might be too heavy in this case?
The following is a grinder from a nightly build a few days ago which shows the failure is still present: https://ci.adoptopenjdk.net/job/Grinder/2917
This problem is effectively impossible to reproduce with logging as the benchmark runs only for several seconds and is very sensitive to any tracing. Even with a specific inlining plan I was not able to get a reproduction of the issue massaging the command line.
However #9120 to the rescue! Using my latest set of prototype changes to improvements to the jitdump I am able to gather the necessary logs with traceFull,paranoidOptCheck
. Unfortunately paranoid opt check does not seem to catch any issues here. I'm attaching the jitdump with tracing usedef generation as well as BVA:
jitdump.20200427.140209.43974.0004.zip
The crash happens here:
│0x3ff90d3eb5e <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3526> brasl %r14,0x3ff90ceb2c0 <_ZN12TR_BitVector5printEPN2TR│
│0x3ff90d3eb64 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3532> lg %r2,184(%r15) │
│0x3ff90d3eb6a <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3538> lg %r1,200(%r15) │
│0x3ff90d3eb70 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3544> lgf %r4,96(%r1) │
│0x3ff90d3eb76 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3550> lg %r3,72(%r2) │
│0x3ff90d3eb7c <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3556> brasl %r14,0x3ff90d37d70 <_ZN21TR_BasicDFSetAnalysisIP1│
│0x3ff90d3eb82 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3562> l %r4,24(%r11) │
│0x3ff90d3eb86 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3566> lgr %r10,%r2 │
>│0x3ff90d3eb8a <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3570> l %r2,20(%r2) │
│0x3ff90d3eb8e <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3574> c %r4,24(%r10) │
│0x3ff90d3eb92 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3578> je 0x3ff90d3f0ae <_ZN23TR_ForwardDFSetAnalysisIP12TR│
│0x3ff90d3eb96 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3582> l %r3,16(%r11) │
│0x3ff90d3eb9a <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3586> c %r3,16(%r10) │
│0x3ff90d3eb9e <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3590> jh 0x3ff90d3f2ee <_ZN23TR_ForwardDFSetAnalysisIP12TR│
│0x3ff90d3eba2 <_ZN23TR_ForwardDFSetAnalysisIP12TR_BitVectorE33analyzeNodeIfPredecessorsAnalyzedEP18TR_RegionStructureRS0_+3594> ltr %r4,%r4 │
Which seems to correspond to the following source code location: https://github.com/eclipse/omr/blob/cfd669060771dd827724daa1622b625ab5aa85aa/compiler/optimizer/BitVectorAnalysis.cpp#L1319
The call (brasl
) immediately preceeding the crash is a call to getContainer
which gets the TR_BitVector
object which we use for the comparison. The equality operator of the TR_BitVector
class can be found here:
https://github.com/eclipse/omr/blob/master/compiler/infra/BitVector.hpp#L594-L611
Which gets inlined into the crashing function. We can see this because the first thing we do in the equality operator is compare the value of _lastChunkWithNonZero
, which happens to be at offset 24 in this JVM:
https://github.com/eclipse/omr/blob/cfd669060771dd827724daa1622b625ab5aa85aa/compiler/infra/BitVector.hpp#L847-L855
That is, (8 + 8 + 4 + 4). And this aligns with the instruction sequence which loads 24 bytes off from the object returned by getContainer
. It appears the the bitvector returned by getContainer
was NULL
for some reason.
@andrewcraik I think we'll need an expert in this area to take a look at the log. Transferring the issue over to you. I can collect additional tracing if needed via the jitdump mechanism. @vijaysun-omr FYI for jitdump work paying dividends already.
Unfortunately paranoid opt check does not seem to catch any issues here
Paraphrasing from the following comment https://github.com/eclipse/openj9/issues/8758#issuecomment-599708966, to get the legacy verifier to report errors, you have to be running a debug build of the JIT, define the TR_DEBUG="checkTypes=1"
environment variable, and be generating a log file.
Failure link
JDK8/J9
test category, openjdk
OS/architecture, s390x_xl (https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.openjdk_s390x_linux_xl/72/)
Optional info
intermittent failure (yes)
New test
Failure output (captured from console output)
Dump Files https://ibm.box.com/shared/static/bq89d6wwq34eg7j85gji425fc4r1ixxd.gz