Closed pshipton closed 4 years ago
@fjeremic
First seen in the 4pm OMR acceptance yesterday, which might help narrow down the list of changes. https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_OMR_testList_0/6 https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_OMR_testList_1/6
@r30shah @harryyu1994 this is a crash in RA:
bin/java -Xjit:vmstate=0x0005ff06 -version
vmState [0x5ff06]: {J9VMSTATE_JIT_CODEGEN} {RegisterAssigning}
There are only three possible changes in the above diff which are Z related and could affect this area. My change affects things during argument mapping which happens right before binary encoding. That leaves the other two changes as the possible culprits. Given this is a blocker on Z we'll need to investigate rather quickly.
@r30shah could you take point on this one and to figure out which change is responsible?
The method always seems to be the same across all the jobs I've looked at:
MLT stderr Method_being_compiled=java/util/TimSort.mergeHi(IIII)V
@fjeremic one crash in the functional tests is different. In particular see https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_Nightly_testList_0/7
Type=Segmentation error vmState=0x0005ff06
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=000003FFB1A3AA30 Handler2=000003FFB1820E00 InaccessibleAddress=0000000000000000
gpr0=0000000000000001 gpr1=000003FF1A67E000 gpr2=000003FF1A6838B0 gpr3=000003FF1A682B30
gpr4=0000000000000000 gpr5=000003FF1A6848D0 gpr6=0000000000000000 gpr7=0480000000000000
gpr8=0000000000000000 gpr9=000003FF1A6848D0 gpr10=000003FF1A6848D0 gpr11=000003FF1A682B30
gpr12=000003FFB289B000 gpr13=000003FF1AA53F40 gpr14=000003FFABC1ADDA gpr15=000003FF1A677C80
psw=000003FFABC0E24C mask=0705000180000000 fpc=00080000 bea=000003FFABC1ADD4
fpr0 401d37124cea4cdf (f: 1290423552.000000, d: 7.303781e+00)
fpr1 3fcd0e0968f0518e (f: 1760579968.000000, d: 2.269909e-01)
fpr2 bfd37124cea4cded (f: 3466907136.000000, d: -3.037807e-01)
fpr3 3fd99a0a735bcfeb (f: 1935396864.000000, d: 4.000269e-01)
fpr4 3f7e3ed2e1f5ce32 (f: 3790982656.000000, d: 7.384132e-03)
fpr5 3fcc7260a3768777 (f: 2742454016.000000, d: 2.222405e-01)
fpr6 4018000000000000 (f: 0.000000, d: 6.000000e+00)
fpr7 3fe5559de13a0d76 (f: 3778678016.000000, d: 6.667013e-01)
fpr8 0000000000100000 (f: 1048576.000000, d: 5.180654e-318)
fpr9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr10 000003ff1a57e000 (f: 441966592.000000, d: 2.171020e-311)
fpr11 000003ffdb27f148 (f: 3676827904.000000, d: 2.172618e-311)
fpr12 000003fffc7f9864 (f: 4236220416.000000, d: 2.172895e-311)
fpr13 000003ffe477b0d8 (f: 3833049344.000000, d: 2.172695e-311)
fpr14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr15 000003ffc787edf8 (f: 3347574272.000000, d: 2.172456e-311)
Module=/home/jenkins/workspace/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_Nightly_testList_0/openjdkbinary/j2sdk-image/jre/lib/s390x/default/libj9jit29.so
Module_base_address=000003FFAB280000
Method_being_compiled=j9vm/test/arrayCopy/ArrayCopyTest.testXtraLargeReferenceArrayCopyBackward(I)V
Target=2_90_20200601_189 (Linux 4.4.0-170-generic)
CPU=s390x (4 logical CPUs) (0x1f723a000 RAM)
@r30shah I looked into a jitdump from one of the failures [1] and here is what I see of the instructions dumped from the failing state:
[ 0x3ff0afe5a10] LGR GPR14,GPR12
[ 0x3ff0afe5930] LG GPR12, Auto[<#SPILL8_980 0x3ff0afd25f0>] ?+0(GPR5) #/* spilled for instanceof */
[ 0x3ff0afe57d0] LG GPR4, Auto[<#SPILL8_981 0x3ff0afd2ad0>] ?+0(GPR5) #/* spilled for instanceof */
[ 0x3ff0afe5670] LR GPR3,GPR7
[ 0x3ff0afe5590] LGR GPR7,GPR2
[ 0x3ff0afe54b0] LR GPR2,GPR9
[ 0x3ff0afe53d0] LG GPR9, Auto[<#SPILL8_975 0x3ff0afcdb60>] ?+0(GPR5) #/* spilled for instanceof */
[ 0x3ff0afe5270] LG GPR1, Auto[<#SPILL8_982 0x3ff0afd2d10>] ?+0(GPR5) #/* spilled for instanceof */
[ 0x3ff0afe5110] L GPR0, Auto[<#SPILL8_976 0x3ff0afcdda0>] ?+0(GPR5) #/* spilled for instanceof */
[ 0x3ff0ae41800] LGR GPR6,GPR7
[ 0x3ff0ae41ac0] Label [ 0x3ff0ae41a60]: # (Start of internal control flow)
[ 0x3ff0ae41c10] CGIJ GPR6,Label [ 0x3ff0ae41bb0],1,BH(mask=0x8),
[ 0x3ff0ae41d10] OILL GPR10,0x1
[ 0x3ff0ae41e00] Label [ 0x3ff0ae41bb0]:
[ 0x3ff0ae42020] LLGF GPR7,#561 0(GPR12)
[ 0x3ff0ae421c0] STPQ GPR10,#562 0(GPR7,GPR12)
[ 0x3ff0ae422a0] AGHI GPR7,0x10
[ 0x3ff0ae423f0] CIJ GPR7,Label [ 0x3ff0ae42390],80,BNH(mask=0x6),
[ 0x3ff0ae424f0] LGHI GPR7,0x10
[ 0x3ff0ae425e0] Label [ 0x3ff0ae42390]:
[ 0x3ff0ae42790] ST GPR7,#563 0(GPR12)
[ 0x3ff0ae42e90] ASSOCREGS
[ 0x3ff0ae428d0] Label [ 0x3ff0ae42870]: # (End of internal control flow)
POST:
{NoReg:GPR_ 0x3ff0ae3b7e0:R} {NoReg:GPR_ 0x3ff0ae31c00:R} {NoReg:GPR_ 0x3ff0ae3b7e0:GPR_ 0x3ff0ae31c00:R} {GPR6:GPR_ 0x3ff0ae3b390:R} {GPR12:GPR_ 0x3ff0ae3c5f0:R} {GPR7:GPR_ 0x3ff0ae41ef0:R}
[ 0x3ff0afe4f30] LGR GPR7,GPR10
[ 0x3ff0afe4e50] LGR GPR10,GPR4
[ 0x3ff0afe4d70] LGR GPR4,GPR11
[ 0x3ff0afe4c90] LR GPR11,GPR0
[ 0x3ff0ae42f60] BRC NOP(0xf), Label [ 0x3ff0ae3b4f0]
</instructions>
This looks related to the instanceof changes in #9517. It makes sense that these failures are only on 64-bit non-compressedrefs JVM which is the only one doing STPQ
instructions. I suspect some issue with register pairs or something which we will need a closer look at.
@fjeremic Thanks a lot for limiting down to this. I still have access to machine used for open build. Trying out to see if I can reproduce the issue and see what is wrong there. I think something odd/incorrect I am doing with register pairs used for STPQ instructions.
I looked into why jitdump recompilations were not reproducing the issue and it seems it is likely because of #9137. I see OSR trees getting generated on the recompilation but not on the initial compilation. I have a hack test fix for this which I will attempt to see if we can get a proper jitdump recompilation crash.
@fjeremic I have reproduced this manually. Looking into the instruction it is failing on, it is the Outlined label for instanceOf. Looking into core-dump to see what is wrong.
I have figured out the issue and also checked the fix with the test locally. Testing the fix with broader test cases again, opening up PR once they are finished.
This one is also different. https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_extended.system_s390x_linux_xl_Nightly/180/ SC_Softmx_Increase_0
Type=Segmentation error vmState=0x0005ff06
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=000000a8 Signal_Code=00000001
Handler1=000003FF7DA3AAF8 Handler2=000003FF7D820E00 InaccessibleAddress=0000000000000000
gpr0=0000000000000001 gpr1=000003FE86EBB000 gpr2=000003FE86EC0C90 gpr3=000003FE86EBFF10
gpr4=0000000000000000 gpr5=000003FE86EC1CB0 gpr6=0000000000000000 gpr7=0480000000000000
gpr8=0000000000000000 gpr9=000003FE86EC1CB0 gpr10=000003FE86EC1CB0 gpr11=000003FE86EBFF10
gpr12=000003FF7E899000 gpr13=000003FE87AB41A0 gpr14=000003FF7CF9ADDA gpr15=000003FEDCAF7C78
psw=000003FF7CF8E24C mask=0705000180000000 fpc=0088fe00 bea=000003FF7CF9ADD4
fpr0 3e1249254cea4cdf (f: 1290423552.000000, d: 1.064369e-09)
fpr1 3fbada67d508f378 (f: 3574133504.000000, d: 1.048951e-01)
fpr2 3e4ccccdcea4cded (f: 3466907136.000000, d: 1.341105e-08)
fpr3 3fd99a0a735bcfeb (f: 1935396864.000000, d: 4.000269e-01)
fpr4 3f7e3ed2e1f5ce32 (f: 3790982656.000000, d: 7.384132e-03)
fpr5 3fcc7260a3768777 (f: 2742454016.000000, d: 2.222405e-01)
fpr6 3fcd0e0968f0518e (f: 1760579968.000000, d: 2.269909e-01)
fpr7 3f8688bb62360e8c (f: 1647709824.000000, d: 1.100298e-02)
fpr8 000000001feae1a0 (f: 535486880.000000, d: 2.645657e-315)
fpr9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr10 000000001fead990 (f: 535484800.000000, d: 2.645647e-315)
fpr11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr12 000000001feae1a8 (f: 535486880.000000, d: 2.645657e-315)
fpr13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
fpr15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/home/jenkins/workspace/Test_openjdk11_j9_extended.system_s390x_linux_xl_Nightly/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so
Module_base_address=000003FF7C600000
Method_being_compiled=java/lang/invoke/StringConcatFactory$Recipe.<init>(Ljava/lang/String;[Ljava/lang/Object;)V
Target=2_90_20200602_186 (Linux 3.10.0-1062.18.1.el7.s390x)
CPU=s390x (4 logical CPUs) (0x1ec1c5000 RAM)
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_s390x_linux_xl_Nightly_otherLoadTest/26 ClassLoadingTest_special_6 variation: Mode113 JVM_OPTIONS: -Xgcpolicy:gencon -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xmn512k -Xnocompressedrefs ClassLoadingTest_special_8 variation: Mode122 JVM_OPTIONS: -Xgcpolicy:optavgpause -Xjit:count=0,optlevel=warm,gcOnResolve,rtResolve -Xnocompressedrefs
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_s390x_linux_xl_Nightly_mauveLoadTest/26 MauveMultiThreadLoadTest_special_6 MauveMultiThreadLoadTest_special_8
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_s390x_linux_xl_Nightly_mathLoadTest/26/ MathLoadTest_all_special_6 MathLoadTest_all_special_8
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_s390x_linux_xl_Nightly_daaLoadTest/26/ DaaLoadTest_daa1_special_6 DaaLoadTest_daa1_special_8 DaaLoadTest_daa2_special_6 DaaLoadTest_daa2_special_8 DaaLoadTest_daa3_special_6 DaaLoadTest_daa3_special_8 DaaLoadTest_all_special_6 DaaLoadTest_all_special_8
JDK8 builds as well https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_s390x_linux_xl_Nightly/383/
https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_Nightly_testList_0/7 https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_sanity.functional_s390x_linux_xl_Nightly_testList_1/7