eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

Mac crash on osx1011-x86-2 in array copy test #9782

Open pshipton opened 4 years ago

pshipton commented 4 years ago

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.functional_x86-64_mac_uma_Nightly_testList_0/5 TestArrayCopy_3 osx1011-x86-2

TestArrayCopy_3 Start Time: Wed Jun  3 02:20:17 2020 Epoch Time (ms): 1591165217431
variation: Mode619
JVM_OPTIONS:  -Xcompressedrefs -Xgcpolicy:optavgpause -Xjit 
[IncludeExcludeTestAnnotationTransformer] [INFO] exclude file is /Users/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_mac_uma_Nightly_testList_0/openjdk-tests/TKG/../TestConfig/resources/excludes/latest_exclude_11.txt
...
... TestNG 6.14.2 by Cédric Beust (cedric@beust.com)
...

Unhandled exception
Type=Segmentation error vmState=0x00000000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=000000000B69E5B0 Handler2=000000000AC98480 InaccessibleAddress=00000000000003A8
RDI=00000000B007F918 RSI=000000000D5FC218 RAX=00000000000003A8 RBX=00000000300B9014
RCX=0000000030787720 RDX=0000000000000001 R8=000000000D5FC218 R9=00000000C00DB2F2
R10=00000000B007F920 R11=00000000B007F910 R12=0000000000000000 R13=000000000D534E60
R14=00007F943B009698 R15=000000000D534B00
RIP=000000000B62FC67 GS=0000 FS=0000 RSP=00000000B007F670
RFlags=0000000000010646 CS=002B RBP=00000000B007F8F0 ERR=000003A800000004
TRAPNO=000000040000000E CPU=03A8000000040000 FAULTVADDR=00000000000003A8
XMM0 0000000000000003 (f: 3.000000, d: 1.482197e-323)
XMM1 00000000b007f2d8 (f: 2953310976.000000, d: 1.459129e-314)
XMM2 000e1781000e1781 (f: 923521.000000, d: 1.959708e-308)
XMM3 747365742f67726f (f: 795308672.000000, d: 8.887822e+252)
XMM4 4570696b532f676e (f: 1395615616.000000, d: 3.174503e+26)
XMM5 00000000df000000 (f: 3741319168.000000, d: 1.848457e-314)
XMM6 3fdd83e7258a2f3e (f: 629813056.000000, d: 4.611757e-01)
XMM7 4023687a9f1af2b1 (f: 2669343488.000000, d: 9.704061e+00)
XMM8 0000000000000003 (f: 3.000000, d: 1.482197e-323)
XMM9 63008d6a2f9976d9 (f: 798586560.000000, d: 7.808518e+168)
XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM11 ca62c1d6ca62c1d6 (f: 3395469824.000000, d: -2.193092e+50)
XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
Module=/Users/jenkins/workspace/Test_openjdk11_j9_sanity.functional_x86-64_mac_uma_Nightly_testList_0/openjdkbinary/j2sdk-image/lib/compressedrefs/libj9vm29.dylib
Module_base_address=000000000B600000 Symbol=_ZN32VM_BytecodeInterpreterCompressed3runEP10J9VMThread
Symbol_address=000000000B6102C0
Target=2_90_20200602_6 (Mac OS X 10.11.6)
CPU=amd64 (4 logical CPUs) (0x200000000 RAM)
----------- Stack Backtrace -----------
---------------------------------------
pshipton commented 4 years ago

@DanHeidinga fyi

DanHeidinga commented 4 years ago

@gacholio can you take a look at this?

gacholio commented 4 years ago

Crash is here:

    frame #15: 0x000000000b62fc68 libj9vm29.dylib`VM_BytecodeInterpreterCompressed::run(J9VMThread*) [inlined] VM_BytecodeInterpreterCompressed::invokevirtualLogic(this=0x00000000b007f908, fromBytecode=true) at BytecodeInterpreter.hpp:6795:18 [opt]
    frame #16: 0x000000000b62fc0c libj9vm29.dylib`VM_BytecodeInterpreterCompressed::run(J9VMThread*) [inlined] VM_BytecodeInterpreterCompressed::invokevirtual(unsigned long*&, unsigned char*&) at BytecodeInterpreter.hpp:6807 [opt]
    frame #17: 0x000000000b62fc0c libj9vm29.dylib`VM_BytecodeInterpreterCompressed::run(this=0x00000000b007f908, vmThread=<unavailable>) at BytecodeInterpreter.hpp:10263 [opt]

Assuming we just recently entered the interpreter, the reason is:

    0x118: UDATA returnValue = 0x0000000000000011 (17)

#define J9_BCLOOP_RETURN_FROM_JIT 0x11

So it seems likely a bad object was returned from a JIT call.

gacholio commented 4 years ago

The stack is not useful (vmThread roots have not been updated). Looking through the registers, R9 maps to a bytecode PC:


!findmethodfrompc 0x00000000C00DB2F2
 Searching for PC=0x00000000C00DB2F2 in VM=0x00007F943C021620...
!j9method 0x000000000D534DC0 java/lang/Object.toString()Ljava/lang/String;
Bytecode PC offset = 0x000000000000000E
> !bytecodes 0x000000000D534DC0
  Name: toString
  Signature: ()Ljava/lang/String;
  Access Flags (50001): public 
  Max Stack: 2
  Argument Count: 1
  Temp Count: 0

    0 newdup 3 java/lang/StringBuilder
    3 dup 
    4 invokespecial 4 java/lang/StringBuilder.<init>()V
    7 aload0 
    8 invokespecial 5 java/lang/Object.getClass()Ljava/lang/Class;
   11 invokevirtual 6 java/lang/Class.getName()Ljava/lang/String;
   14 invokevirtual 7 java/lang/StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder;
   17 bipush 64
   19 invokevirtual 8 java/lang/StringBuilder.append(C)Ljava/lang/StringBuilder;
   22 aload0 
   23 invokevirtual 9 java/lang/Object.hashCode()I
   26 invokestatic 10 java/lang/Integer.toHexString(I)Ljava/lang/String;
   29 invokevirtual 7 java/lang/StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder;
   32 invokevirtual 11 java/lang/StringBuilder.toString()Ljava/lang/String;
   35 return1 

  Debug Info:
    Line Number Table (1):
      Line:   164 PC:     0

    Variables (1):
      Slot: 0
      Visibility Start: 0
      Visibility End: 36
      Visibility Length: 36
      Name: this
      Signature: Ljava/lang/Object;
      Generic Signature: None
gacholio commented 4 years ago

My best guess at the stack (from the registers and hand-examining the memory) is:

> !stackslots 0xd51d700,0x0D5FC218,0x0D5FC240,0x00000000C00DB2F2,0x000000000D534DC0
<d51d700> *** BEGIN STACK WALK, flags = 00400001 walkThread = 0x000000000D51D700 ***
<d51d700>   ITERATE_O_SLOTS
<d51d700>   RECORD_BYTECODE_PC_OFFSET
<d51d700> Initial values: walkSP = 0x000000000D5FC218, PC = 0x00000000C00DB2F2, literals = 0x000000000D534DC0, A0 = 0x000000000D5FC240, j2iFrame = 0x0000000000000000, ELS = 0x00000000B007FAD8, decomp = 0x0000000000000000
<d51d700> Bytecode frame: bp = 0x000000000D5FC238, sp = 0x000000000D5FC218, pc = 0x00000000C00DB2F2, cp = 0x000000000D534E60, arg0EA = 0x000000000D5FC240, flags = 0x0000000000000000
<d51d700>   Method: java/lang/Object.toString()Ljava/lang/String; !j9method 0x000000000D534DC0
<d51d700>   Bytecode index = 14
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC240 for 0x0000000000000001 slots
<d51d700>       O-Slot: a0[0x000000000D5FC240] = 0x0000000030685868
<d51d700>   Pending stack starting at 0x000000000D5FC220 for UDATA(0x0000000000000002) slots
<d51d700>       O-Slot: p0[0x000000000D5FC220] = 0x0000000030787720
<d51d700>       O-Slot: p1[0x000000000D5FC218] = 0x00000000300B9050
<d51d700> Bytecode frame: bp = 0x000000000D5FC258, sp = 0x000000000D5FC248, pc = 0x000000002624FDD3, cp = 0x000000000D87DB90, arg0EA = 0x000000000D5FC2B8, flags = 0x0000000000000000
<d51d700>   Method: org/testng/internal/TestResult.init(Lorg/testng/IClass;Ljava/lang/Object;Lorg/testng/ITestNGMethod;Ljava/lang/Throwable;JJLorg/testng/ITestContext;)V !j9method 0x000000000D87E2E8
<d51d700>   Bytecode index = 143
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC2B8 for 0x000000000000000C slots
<d51d700>       O-Slot: a0[0x000000000D5FC2B8] = 0x0000000030787690
<d51d700>       I-Slot: a1[0x000000000D5FC2B0] = 0x00000000301833D8
<d51d700>       I-Slot: a2[0x000000000D5FC2A8] = 0x0000000030685868
<d51d700>       I-Slot: a3[0x000000000D5FC2A0] = 0x00000000306CD228
<d51d700>       I-Slot: a4[0x000000000D5FC298] = 0x0000000030787580
<d51d700>       I-Slot: a5[0x000000000D5FC290] = 0x000000000D87E2C8
<d51d700>       I-Slot: a6[0x000000000D5FC288] = 0x0000017278D7A005
<d51d700>       I-Slot: a7[0x000000000D5FC280] = 0x00000000307876F8
<d51d700>       I-Slot: a8[0x000000000D5FC278] = 0x0000017278D7F65D
<d51d700>       I-Slot: a9[0x000000000D5FC270] = 0x000000003016EED0
<d51d700>       I-Slot: t10[0x000000000D5FC268] = 0x00000000307876F8
<d51d700>       I-Slot: t11[0x000000000D5FC260] = 0x00000000307876F8
<d51d700> Bytecode frame: bp = 0x000000000D5FC2D0, sp = 0x000000000D5FC2C0, pc = 0x000000002624FD0E, cp = 0x000000000D87DB90, arg0EA = 0x000000000D5FC320, flags = 0x0000000000000000
<d51d700>   Method: org/testng/internal/TestResult.<init>(Lorg/testng/IClass;Ljava/lang/Object;Lorg/testng/ITestNGMethod;Ljava/lang/Throwable;JJLorg/testng/ITestContext;)V !j9method 0x000000000D87E2C8
<d51d700>   Bytecode index = 70
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC320 for 0x000000000000000A slots
<d51d700>       I-Slot: a0[0x000000000D5FC320] = 0x0000000030787690
<d51d700>       I-Slot: a1[0x000000000D5FC318] = 0x00000000301833D8
<d51d700>       I-Slot: a2[0x000000000D5FC310] = 0x0000000030685868
<d51d700>       I-Slot: a3[0x000000000D5FC308] = 0x00000000306CD228
<d51d700>       I-Slot: a4[0x000000000D5FC300] = 0x0000000030787580
<d51d700>       I-Slot: a5[0x000000000D5FC2F8] = 0x00000000301833D8
<d51d700>       I-Slot: a6[0x000000000D5FC2F0] = 0x0000017278D7A005
<d51d700>       I-Slot: a7[0x000000000D5FC2E8] = 0x000000000D84E010
<d51d700>       I-Slot: a8[0x000000000D5FC2E0] = 0x0000017278D7F65D
<d51d700>       I-Slot: a9[0x000000000D5FC2D8] = 0x000000003016EED0
<d51d700> Bytecode frame: bp = 0x000000000D5FC340, sp = 0x000000000D5FC328, pc = 0x000000002621648C, cp = 0x000000000D84C370, arg0EA = 0x000000000D5FC458, flags = 0x0000000000000000
<d51d700>   Method: org/testng/internal/Invoker.invokeTestMethods(Lorg/testng/ITestNGMethod;Lorg/testng/xml/XmlSuite;Ljava/util/Map;Lorg/testng/internal/ConfigurationGroupMethods;Ljava/lang/Object;Lorg/testng/ITestContext;)Ljava/util/List; !j9method 0x000000000D84E010
<d51d700>   Bytecode index = 1,124
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC458 for 0x0000000000000023 slots
<d51d700>       O-Slot: a0[0x000000000D5FC458] = 0x0000000030173178
<d51d700>       O-Slot: a1[0x000000000D5FC450] = 0x00000000306CD228
<d51d700>       O-Slot: a2[0x000000000D5FC448] = 0x0000000030135AA0
<d51d700>       I-Slot: a3[0x000000000D5FC440] = 0x0000000030800CE0
<d51d700>       O-Slot: a4[0x000000000D5FC438] = 0x000000003076B280
<d51d700>       O-Slot: a5[0x000000000D5FC430] = 0x0000000030685868
<d51d700>       O-Slot: a6[0x000000000D5FC428] = 0x000000003016EED0
<d51d700>       I-Slot: t7[0x000000000D5FC420] = 0x0000000000000000
<d51d700>       O-Slot: t8[0x000000000D5FC418] = 0x00000000306D2648
<d51d700>       I-Slot: t9[0x000000000D5FC410] = 0x000000000D87CFC8
<d51d700>       I-Slot: t10[0x000000000D5FC408] = 0x0000000000000000
<d51d700>       I-Slot: t11[0x000000000D5FC400] = 0x0000000000000000
<d51d700>       I-Slot: t12[0x000000000D5FC3F8] = 0x0000000000000000
<d51d700>       O-Slot: t13[0x000000000D5FC3F0] = 0x00000000306D26A0
<d51d700>       O-Slot: t14[0x000000000D5FC3E8] = 0x00000000301833D8
<d51d700>       O-Slot: t15[0x000000000D5FC3E0] = 0x00000000306D2728
<d51d700>       O-Slot: t16[0x000000000D5FC3D8] = 0x00000000306D2740
<d51d700>       O-Slot: t17[0x000000000D5FC3D0] = 0x00000000306D2780
<d51d700>       O-Slot: t18[0x000000000D5FC3C8] = 0x00000000306D27A8
<d51d700>       I-Slot: t19[0x000000000D5FC3C0] = 0x000000000D8715F8
<d51d700>       I-Slot: t20[0x000000000D5FC3B8] = 0x0000017278D7A005
<d51d700>       I-Slot: t21[0x000000000D5FC3B0] = 0x00000000306D27B8
<d51d700>       I-Slot: t22[0x000000000D5FC3A8] = 0x00000000306D27E0
<d51d700>       I-Slot: t23[0x000000000D5FC3A0] = 0x00000000306D2A20
<d51d700>       I-Slot: t24[0x000000000D5FC398] = 0x00000000306D29F8
<d51d700>       I-Slot: t25[0x000000000D5FC390] = 0x0000000000000000
<d51d700>       I-Slot: t26[0x000000000D5FC388] = 0x0000000030787580
<d51d700>       I-Slot: t27[0x000000000D5FC380] = 0x00000000306D2EF0
<d51d700>       I-Slot: t28[0x000000000D5FC378] = 0x00000000306D2F00
<d51d700>       I-Slot: t29[0x000000000D5FC370] = 0x00000000FFFFFFFF
<d51d700>       I-Slot: t30[0x000000000D5FC368] = 0x00000000307CCC08
<d51d700>       I-Slot: t31[0x000000000D5FC360] = 0x00000000307CCC08
<d51d700>       I-Slot: t32[0x000000000D5FC358] = 0x0000000030787580
<d51d700>       I-Slot: t33[0x000000000D5FC350] = 0x0000000000000000
<d51d700>       I-Slot: t34[0x000000000D5FC348] = 0x000000000D8864B8
<d51d700>   Pending stack starting at 0x000000000D5FC328 for UDATA(0x0000000000000001) slots
<d51d700>       O-Slot: p0[0x000000000D5FC328] = 0x0000000030787690
<d51d700> Bytecode frame: bp = 0x000000000D5FC470, sp = 0x000000000D5FC460, pc = 0x000000002624DB71, cp = 0x000000000D87C5A0, arg0EA = 0x000000000D5FC498, flags = 0x0000000000000000
<d51d700>   Method: org/testng/internal/TestMethodWorker.invokeTestMethods(Lorg/testng/ITestNGMethod;Ljava/lang/Object;Lorg/testng/ITestContext;)V !j9method 0x000000000D87CFA8
<d51d700>   Bytecode index = 21
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC498 for 0x0000000000000005 slots
<d51d700>       O-Slot: a0[0x000000000D5FC498] = 0x00000000308083F8
<d51d700>       I-Slot: a1[0x000000000D5FC490] = 0x00000000306CD228
<d51d700>       I-Slot: a2[0x000000000D5FC488] = 0x0000000030685868
<d51d700>       I-Slot: a3[0x000000000D5FC480] = 0x000000003016EED0
<d51d700>       I-Slot: t4[0x000000000D5FC478] = 0x000000002624DAB8
<d51d700> Bytecode frame: bp = 0x000000000D5FC4B0, sp = 0x000000000D5FC4A0, pc = 0x000000002624DABF, cp = 0x000000000D87C5A0, arg0EA = 0x000000000D5FC4E0, flags = 0x0000000000000000
<d51d700>   Method: org/testng/internal/TestMethodWorker.run()V !j9method 0x000000000D87CF88
<d51d700>   Bytecode index = 63
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC4E0 for 0x0000000000000006 slots
<d51d700>       O-Slot: a0[0x000000000D5FC4E0] = 0x00000000308083F8
<d51d700>       O-Slot: t1[0x000000000D5FC4D8] = 0x00000000306D25D0
<d51d700>       O-Slot: t2[0x000000000D5FC4D0] = 0x0000000030800BC8
<d51d700>       I-Slot: t3[0x000000000D5FC4C8] = 0x00000000306CD228
<d51d700>       O-Slot: t4[0x000000000D5FC4C0] = 0x00000000301833D8
<d51d700>       I-Slot: t5[0x000000000D5FC4B8] = 0x000000000D845DD8
<d51d700> Bytecode frame: bp = 0x000000000D5FC4F8, sp = 0x000000000D5FC4E8, pc = 0x00000000262087F2, cp = 0x000000000D844320, arg0EA = 0x000000000D5FC540, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestRunner.privateRun(Lorg/testng/xml/XmlTest;)V !j9method 0x000000000D845DD8
<d51d700>   Bytecode index = 302
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC540 for 0x0000000000000009 slots
<d51d700>       O-Slot: a0[0x000000000D5FC540] = 0x000000003016EED0
<d51d700>       I-Slot: a1[0x000000000D5FC538] = 0x00000000300B2098
<d51d700>       I-Slot: t2[0x000000000D5FC530] = 0x0000000000000000
<d51d700>       I-Slot: t3[0x000000000D5FC528] = 0x0000000000000001
<d51d700>       O-Slot: t4[0x000000000D5FC520] = 0x00000000307DDD08
<d51d700>       O-Slot: t5[0x000000000D5FC518] = 0x00000000307E7D60
<d51d700>       I-Slot: t6[0x000000000D5FC510] = 0x00000000308008E0
<d51d700>       O-Slot: t7[0x000000000D5FC508] = 0x0000000030808568
<d51d700>       I-Slot: t8[0x000000000D5FC500] = 0x00000000308083F8
<d51d700> Bytecode frame: bp = 0x000000000D5FC558, sp = 0x000000000D5FC548, pc = 0x0000000026208561, cp = 0x000000000D844320, arg0EA = 0x000000000D5FC570, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestRunner.run()V !j9method 0x000000000D845D78
<d51d700>   Bytecode index = 25
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC570 for 0x0000000000000003 slots
<d51d700>       O-Slot: a0[0x000000000D5FC570] = 0x000000003016EED0
<d51d700>       I-Slot: t1[0x000000000D5FC568] = 0x00000000300B2098
<d51d700>       I-Slot: t2[0x000000000D5FC560] = 0x00000000261FBF81
<d51d700> Bytecode frame: bp = 0x000000000D5FC588, sp = 0x000000000D5FC578, pc = 0x00000000261FBF81, cp = 0x000000000D83F2B0, arg0EA = 0x000000000D5FC5A0, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunner.runTest(Lorg/testng/TestRunner;)V !j9method 0x000000000D840410
<d51d700>   Bytecode index = 1
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC5A0 for 0x0000000000000003 slots
<d51d700>       O-Slot: a0[0x000000000D5FC5A0] = 0x000000003016C4E0
<d51d700>       O-Slot: a1[0x000000000D5FC598] = 0x000000003016EED0
<d51d700>       I-Slot: t2[0x000000000D5FC590] = 0x000000000D5FC5D2
<d51d700> Bytecode frame: bp = 0x000000000D5FC5B8, sp = 0x000000000D5FC5A8, pc = 0x00000000261FBF3F, cp = 0x000000000D83F2B0, arg0EA = 0x000000000D5FC5D0, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunner.runSequentially()V !j9method 0x000000000D8403F0
<d51d700>   Bytecode index = 31
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC5D0 for 0x0000000000000003 slots
<d51d700>       O-Slot: a0[0x000000000D5FC5D0] = 0x000000003016C4E0
<d51d700>       O-Slot: t1[0x000000000D5FC5C8] = 0x00000000307BF120
<d51d700>       I-Slot: t2[0x000000000D5FC5C0] = 0x000000003016EED0
<d51d700> Bytecode frame: bp = 0x000000000D5FC5E8, sp = 0x000000000D5FC5D8, pc = 0x00000000261FBD98, cp = 0x000000000D83F2B0, arg0EA = 0x000000000D5FC638, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunner.privateRun()V !j9method 0x000000000D840370
<d51d700>   Bytecode index = 288
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC638 for 0x000000000000000A slots
<d51d700>       O-Slot: a0[0x000000000D5FC638] = 0x000000003016C4E0
<d51d700>       I-Slot: t1[0x000000000D5FC630] = 0x00000000307BEFB0
<d51d700>       O-Slot: t2[0x000000000D5FC628] = 0x00000000307BEFE8
<d51d700>       O-Slot: t3[0x000000000D5FC620] = 0x0000000030173178
<d51d700>       I-Slot: t4[0x000000000D5FC618] = 0x0000000000000000
<d51d700>       I-Slot: t5[0x000000000D5FC610] = 0x000000003016EED0
<d51d700>       I-Slot: t6[0x000000000D5FC608] = 0x0000000030764698
<d51d700>       I-Slot: t7[0x000000000D5FC600] = 0x0000000000000000
<d51d700>       I-Slot: t8[0x000000000D5FC5F8] = 0x0000000000000000
<d51d700>       I-Slot: t9[0x000000000D5FC5F0] = 0x00000000261FB9AA
<d51d700> Bytecode frame: bp = 0x000000000D5FC650, sp = 0x000000000D5FC640, pc = 0x00000000261FBC16, cp = 0x000000000D83F2B0, arg0EA = 0x000000000D5FC660, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunner.run()V !j9method 0x000000000D840350
<d51d700>   Bytecode index = 6
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC660 for 0x0000000000000002 slots
<d51d700>       O-Slot: a0[0x000000000D5FC660] = 0x000000003016C4E0
<d51d700>       I-Slot: t1[0x000000000D5FC658] = 0x000000000D5FC6AA
<d51d700> Bytecode frame: bp = 0x000000000D5FC678, sp = 0x000000000D5FC668, pc = 0x00000000262426EF, cp = 0x000000000D876E90, arg0EA = 0x000000000D5FC6A8, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunnerWorker.runSuite(Lorg/testng/internal/SuiteRunnerMap;Lorg/testng/xml/XmlSuite;)V !j9method 0x000000000D877610
<d51d700>   Bytecode index = 87
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC6A8 for 0x0000000000000006 slots
<d51d700>       I-Slot: a0[0x000000000D5FC6A8] = 0x00000000307BE528
<d51d700>       O-Slot: a1[0x000000000D5FC6A0] = 0x000000003016B708
<d51d700>       O-Slot: a2[0x000000000D5FC698] = 0x000000003013E098
<d51d700>       I-Slot: t3[0x000000000D5FC690] = 0x000000003016C4E0
<d51d700>       I-Slot: t4[0x000000000D5FC688] = 0x0000000026242804
<d51d700>       I-Slot: t5[0x000000000D5FC680] = 0x000000000D877630
<d51d700> Bytecode frame: bp = 0x000000000D5FC6C0, sp = 0x000000000D5FC6B0, pc = 0x0000000026242804, cp = 0x000000000D876E90, arg0EA = 0x000000000D5FC6C8, flags = 0x0000000000000000
<d51d700>   Method: org/testng/SuiteRunnerWorker.run()V !j9method 0x000000000D877630
<d51d700>   Bytecode index = 12
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC6C8 for 0x0000000000000001 slots
<d51d700>       I-Slot: a0[0x000000000D5FC6C8] = 0x00000000307BE528
<d51d700> Bytecode frame: bp = 0x000000000D5FC6E0, sp = 0x000000000D5FC6D0, pc = 0x0000000025248316, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC718, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.runSuitesSequentially(Lorg/testng/xml/XmlSuite;Lorg/testng/internal/SuiteRunnerMap;ILjava/lang/String;)V !j9method 0x000000000D6A6FD8
<d51d700>   Bytecode index = 66
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC718 for 0x0000000000000007 slots
<d51d700>       I-Slot: a0[0x000000000D5FC718] = 0x00000000300A7160
<d51d700>       I-Slot: a1[0x000000000D5FC710] = 0x000000003013E098
<d51d700>       I-Slot: a2[0x000000000D5FC708] = 0x000000003016B708
<d51d700>       I-Slot: a3[0x000000000D5FC700] = 0x0000000000000002
<d51d700>       I-Slot: a4[0x000000000D5FC6F8] = 0x00000000300A5738
<d51d700>       I-Slot: t5[0x000000000D5FC6F0] = 0x00000000307BE528
<d51d700>       I-Slot: t6[0x000000000D5FC6E8] = 0x00000000252480D4
<d51d700> Bytecode frame: bp = 0x000000000D5FC730, sp = 0x000000000D5FC720, pc = 0x00000000252480D4, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC760, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.runSuitesLocally()Ljava/util/List; !j9method 0x000000000D6A6F78
<d51d700>   Bytecode index = 156
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC760 for 0x0000000000000006 slots
<d51d700>       O-Slot: a0[0x000000000D5FC760] = 0x00000000300A7160
<d51d700>       O-Slot: t1[0x000000000D5FC758] = 0x000000003016B708
<d51d700>       O-Slot: t2[0x000000000D5FC750] = 0x00000000307B5EF0
<d51d700>       I-Slot: t3[0x000000000D5FC748] = 0x000000003013E098
<d51d700>       I-Slot: t4[0x000000000D5FC740] = 0x0000000000000000
<d51d700>       I-Slot: t5[0x000000000D5FC738] = 0x000000000D6A6EB8
<d51d700> Bytecode frame: bp = 0x000000000D5FC778, sp = 0x000000000D5FC768, pc = 0x0000000025247D51, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC780, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.runSuites()Ljava/util/List; !j9method 0x000000000D6A6E98
<d51d700>   Bytecode index = 1
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC780 for 0x0000000000000001 slots
<d51d700>       I-Slot: a0[0x000000000D5FC780] = 0x000000003014A918
<d51d700> Bytecode frame: bp = 0x000000000D5FC798, sp = 0x000000000D5FC788, pc = 0x0000000025247CBD, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC7A8, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.run()V !j9method 0x000000000D6A6E78
<d51d700>   Bytecode index = 25
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC7A8 for 0x0000000000000002 slots
<d51d700>       O-Slot: a0[0x000000000D5FC7A8] = 0x00000000300A7160
<d51d700>       I-Slot: t1[0x000000000D5FC7A0] = 0x000000000D5FC7E2
<d51d700> Bytecode frame: bp = 0x000000000D5FC7C0, sp = 0x000000000D5FC7B0, pc = 0x0000000025248767, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC7E0, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.privateMain([Ljava/lang/String;Lorg/testng/ITestListener;)Lorg/testng/TestNG; !j9method 0x000000000D6A7098
<d51d700>   Bytecode index = 59
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC7E0 for 0x0000000000000004 slots
<d51d700>       I-Slot: a0[0x000000000D5FC7E0] = 0x000000003012AB80
<d51d700>       I-Slot: a1[0x000000000D5FC7D8] = 0x0000000000000000
<d51d700>       O-Slot: t2[0x000000000D5FC7D0] = 0x00000000300A7160
<d51d700>       I-Slot: t3[0x000000000D5FC7C8] = 0x000000003020F7E8
<d51d700> Bytecode frame: bp = 0x000000000D5FC7F8, sp = 0x000000000D5FC7E8, pc = 0x00000000252486F6, cp = 0x000000000D6A4220, arg0EA = 0x000000000D5FC808, flags = 0x0000000000000000
<d51d700>   Method: org/testng/TestNG.main([Ljava/lang/String;)V !j9method 0x000000000D6A7078
<d51d700>   Bytecode index = 2
<d51d700>   Using local mapper
<d51d700>   Locals starting at 0x000000000D5FC808 for 0x0000000000000002 slots
<d51d700>       I-Slot: a0[0x000000000D5FC808] = 0x000000003012AB80
<d51d700>       I-Slot: t1[0x000000000D5FC800] = 0x000000000D5FC828
<d51d700> JNI call-in frame: bp = 0x000000000D5FC830, sp = 0x000000000D5FC810, pc = 0x000000000B7E0480, cp = 0x0000000000000000, arg0EA = 0x000000000D5FC830, flags = 0x0000000000020000
<d51d700>   New ELS = 0x0000000000000000
<d51d700> JNI native method frame: bp = 0x000000000D5FC8D8, sp = 0x000000000D5FC838, pc = 0x0000000000000007, cp = 0x0000000000000000, arg0EA = 0x000000000D5FC8D8, flags = 0x0000000000000010
<d51d700>   Object pushes starting at 0x000000000D5FC838 for 16 slots
<d51d700>       Push[0x000000000D5FC838] = 0x0000000030000000
<d51d700>       Push[0x000000000D5FC840] = 0x000000003009E9C0
<d51d700>       Push[0x000000000D5FC848] = 0x0000000000000000
<d51d700>       Push[0x000000000D5FC850] = 0x0000000000000000
<d51d700>       Push[0x000000000D5FC858] = 0x00000000300A3368
<d51d700>       Push[0x000000000D5FC860] = 0x0000000030001818
<d51d700>       Push[0x000000000D5FC868] = 0x000000003009E968
<d51d700>       Push[0x000000000D5FC870] = 0x000000003009E968
<d51d700>       Push[0x000000000D5FC878] = 0x00000000300771E0
<d51d700>       Push[0x000000000D5FC880] = 0x0000000000000000
<d51d700>       Push[0x000000000D5FC888] = 0x000000003007BB00
<d51d700>       Push[0x000000000D5FC890] = 0x0000000030008DE8
<d51d700>       Push[0x000000000D5FC898] = 0x0000000030008D38
<d51d700>       Push[0x000000000D5FC8A0] = 0x00000000300010E0
<d51d700>       Push[0x000000000D5FC8A8] = 0x0000000030003F20
<d51d700>       Push[0x000000000D5FC8B0] = 0x0000000030000000
<d51d700> <end of stack>
<d51d700> *** END STACK WALK (rc = NONE) ***
gacholio commented 4 years ago
> !j9object 0x0000000030685868 <--- this
!J9Object 0x0000000030685868 {
    struct J9Class* clazz = !j9class 0xD818100 // j9vm/test/arrayCopy/ArrayCopyTest
    Object flags = 0x00000008;
    I lockword = 0x00000000 (offset = 0) (java/lang/Object) <hidden>
}
> !j9object 0x00000000300B9050 <--- this.getName()
J9VMJavaLangString at 0x00000000300B9050 {
struct J9Class* clazz = !j9class 0xD541400 // java/lang/String
Object flags = 0x00000000;
[B value = !fj9object 0x300b2198 (offset = 0) (java/lang/String)
B coder = 0x00000001 (offset = 4) (java/lang/String)
I hashCode = 0x1E58820C (offset = 8) (java/lang/String)
"j9vm.test.arrayCopy.ArrayCopyTest"
}
> !j9object 0x0000000030787720 <--- StringBuilder
Unable to read object clazz at 0x0000000030787720 (clazz = (null))

So the String argument is fine, but the StringBuilder appears to be trash.

gacholio commented 4 years ago

I suspect jitted AbstractStringBuilder.init(I)V returned the bad object:

> !j9method 0x000000000D5AC0F8
J9Method at 0xd5ac0f8 {
  Fields for J9Method:
    0x0: U8* bytecodes = !j9x 0x00000000C0198AF4
    0x8: struct J9ConstantPool* constantPool = !j9constantpool 0x000000000D5AD7A0 (flags = 0x0)
    0x10: void* methodRunAddress = !j9x 0x0000000000000006
    0x18: void* extra = !j9x 0x0000000000000783
}
Signature: java/lang/StringBuilder.<init>()V !bytecodes 0x000000000D5AC0F8
ROM Method: !j9rommethod 0x00000000C0198AE0
Next Method: !j9method 0x000000000D5AC118
> !bytecodes 0x000000000D5AC0F8
  Name: <init>
  Signature: ()V
  Access Flags (21040001): public 
  Max Stack: 2
  Argument Count: 1
  Temp Count: 0

    0 aload0 
    1 bipush 16
    3 invokespecial 1 java/lang/AbstractStringBuilder.<init>(I)V
    6 returnFromConstructor 

> !j9method 0x000000000D5AA760
J9Method at 0xd5aa760 {
  Fields for J9Method:
    0x0: U8* bytecodes = !j9x 0x00000000C019AA88
    0x8: struct J9ConstantPool* constantPool = !j9constantpool 0x000000000D5AB140 (flags = 0x0)
    0x10: void* methodRunAddress = !j9x 0x0000000000000018
    0x18: void* extra = !j9x 0x00000000B00ADC80
}
Signature: java/lang/AbstractStringBuilder.<init>(I)V !bytecodes 0x000000000D5AA760
ROM Method: !j9rommethod 0x00000000C019AA74
Next Method: !j9method 0x000000000D5AA780
gacholio commented 4 years ago

@andrewcraik

andrewcraik commented 4 years ago

@gacholio / @DanHeidinga do you have any failure rate information for this or any history of when it might have started. I agree we should look at it, but I'm not sure if 0.21 is an appropriate target.

@cathyzhyi could you help find someone to look at this? Note the 0.21 tag.

DanHeidinga commented 4 years ago

any history of when it might have started.

It appears to be a new failure we haven't seen before. The first instance I can find is the one in the opening description.

but I'm not sure if 0.21 is an appropriate target.

I added the 0.21 tag to ensure there would be a conversation about whether this is stop-ship or not given it appears to be a newly introduced (exposed?) issue

andrewcraik commented 4 years ago

@liqunl could you have a sniff?

liqunl commented 4 years ago

I looked at the disassembly of java/lang/AbstractStringBuilder.<init>(I)V, haven't found anything suspicious. I have also tried to reproduce the failure on my laptop, but haven't seen one in 50 runs. I'll try if I can reproduce it with grinders.

liqunl commented 4 years ago

I've tried another 50 runs on my laptop, no reproduce. I'm trying to run grinders on internal jenkins, but I can't find a Mac uma build. @pshipton What is a Mac uma build? How is it different from Mac build?

pshipton commented 4 years ago

A Mac uma build is using the old (uma) makefile system, any build before May 28th used this. Since https://github.com/ibmruntimes/openj9-openjdk-jdk11/pull/308 was merged, the jdk11 Mac builds use the cmake build system, which is supposed to generate the same binaries. We continue to run a nightly build using the old uma makefile system in case any problems occur, we can compare between the two builds. Likely the problem isn't particular to the uma build, it's just intermittent and happened in that build. We'll be switching the xlinux cmake jdk11 build to cmake next, ideally before the end of next week.

liqunl commented 4 years ago

I tried 100 jobs on internal grinder, still can't reproduce the issue. So far the crash is not reproducible in 200 jobs.

liqunl commented 4 years ago

The StringBuilder object seems to be corrupted by array zero initialization instruction in the jitted code. The zero initialization is accomplished by rep stosb and it seems to be doing backward store in this case.

The bad StringBuilder object has everything right except the header. The header is zero. However, the header can’t be zero when passed to the jitted code, as the jitted code read a flag on the object's j9class to check if the object is finalizable.

I double checked the jitted code to see if it’s possible that we can write to the StringBuilder object. Notice that the jitted code will create a byte array and assign it to StringBuilder.value. Both the StringBuilder and the byte array are in TLH and they’re next to each other with a few padding bytes in between. The byte array is supposed to be zero initialized and remain so before we crash. But not all elements are zero.

> !j9x 0x0000000030787720,32,4,1  
0x30787720 :  00000000 [ .... ]. <— StringBuilder object
0x30787724 :  00000000 [ .... ] 
0x30787728 :  30787738 [ 8wx0 ]  <— value field, byte array of 32 elements
0x3078772C :  00000001 [ .... ]  <— coder field
0x30787730 :  00000000 [ .... ]  <- count field
0x30787734 :  00000000 [ .... ]  <— padding bytes
0x30787738 :  0d53d900 [ ..S. ]  <-- byte array header
0x3078773C :  00000020 [  ... ]  <-- byte array size
0x30787740 :  00000000 [ .... ] 
0x30787744 :  30184100 [ .A.0 ]. <—- non zero elements
0x30787748 :  00000000 [ .... ]                                                                                                                                      
0x3078774C :  30184128 [ (A.0 ]  <—- non zero elements

> !j9object 0x30787738
!J9IndexableObject 0x0000000030787738 {
    struct J9Class* clazz = !j9arrayclass 0xD53D900   // [B
    Object flags = 0x00000000;
    U_32 size = 0x00000020;

So it seems to me that we didn’t initialize the byte array. We didn’t skip zero init the byte array either. The following instruction is used to zero the array. rdi initially points to the first element of the array, the counter rcx is 0x20.

rep stosb byte ptr [rdi], al

I suspect that we did a backward memset. We didn’t clear the direction flag DF before rep stosb, so if the flag is set by previously executed instructions, we may end up doing a backward memset.

The memory content seems to be in line with this theory. rdi is the address of the first element, which is 0x30787740, if we do backward memset for 0x20 bytes, we’ll zero the memory all the way up to 0x30787720, which is the header of StringBuilder object. The byte array header and fields of StringBuilder object are good because that is set after the zeroing.

@andrewcraik @JamesKingdon Do you think this is the right cause of the crash. If so, I think the solution is to clear the DF flag before zeroing.

gacholio commented 4 years ago

I was under the impression that DF is supposed to remain clear at all times (in C at least). It would be a simple matter to clear it on transition to JIT code (and maybe you would need to do it on return from JNI calls).

JamesKingdon commented 4 years ago

@liqunl Great find! It certainly sounds possible. Given GAC's comments I need to check how DF can be set, but I'm leaning towards a conservative fix of making sure it has the correct value before executing the rep sto. @andrewcraik ?

Looks like it can only be changed directly using CLD/STD. I feel a little uneasy about relying on it being in the correct state before the rep sto though.

andrewcraik commented 4 years ago

Looking at the Agner Fog instruction tables https://www.agner.org/optimize/instruction_tables.pdf it seems that CLD/STD are not cheap - latency 4 on skylake/skylakeX and latency 3 on ryzen. I think we need to be a bit careful about spraying CLD etc everywhere.

Now there is some curious looking code in https://github.com/eclipse/openj9/blob/71bf16037a88e1116afdcc7516db040aa5365ab1/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L1622 Note if it is !node->isForwardArrayCopy() we handle the direction flag and yet seemingly underneath we don't?

We can consider the CLD/STD always solution as a temporary fix but we need to perf evaluate that and also try to figure out how the flag got messed up since protecting around natives and stuff as suggested by @gacholio is probably the better long term solution.

JamesKingdon commented 4 years ago

oh, you mean potentially we set STD for a backward array copy and then never clear it? I'm still very nervous about the idea of relying on it having the right value over long periods of time.

gacholio commented 4 years ago

The rationale is that backwards copy is vastly less likely than forwards, so keeping the DF clear when not doing backwards copy will be a noticeable performance win.

gacholio commented 4 years ago

From what I can see, backwards copy always emits STD/REP/CLD, so it's still a mystery why it would ever be set (assuming I'm right about the C linkage keeping it clear).

JamesKingdon commented 4 years ago

DF can also be set via popf which makes it a little harder to narrow down possible changes. Potentially DF might have been set as an accidental side effect of some other flag manipulation. A very cursory grep of the codebase shows some hits for popf* but I haven't explored further to see what's being done.

gacholio commented 4 years ago

Assuming pushf/pop are matched, that shouldn't be an issue. If there are places using popf on other stacked values, that could certainly cause this (this seems exceedingly unlikely).

JamesKingdon commented 4 years ago

util/sse2enabled.s is doing a flag manipulation, but it looks ok. Other popf instances seem to be simple restore of saved flags, we would have to have mismatched the stacks to get those wrong.

liqunl commented 4 years ago

@JamesKingdon suggested offline, that this problem should be detectable if we check DF flag frequently enough. I'll try adding the check to the jit method entry to see if we can detect it.

liqunl commented 4 years ago

I did personal builds on x86 linux, windows and Mac, with a check on the DF flag in method entry. If the flag is set, the jitted code will crash. All the jobs passed. Also did a 40x grinder on the original failed job, no crash either. The issue seems to be extremely intermittent.

As for the temporary fix, clearing the DF flag before every rep instructions has 2% overhead on daytrader3. If we only clear DF flag after the return of helper calls and JNI calls (assuming VM will clear the flag before calling to jitted code), the overhead is around 0.5% (with variation of 0.6%).

@andrewcraik @JamesKingdon Could you advice on how we should proceed with this issue?

gacholio commented 4 years ago

On the remote possibility that the DF is being set by an interrupt handler, clearing it before the rep tightens the window but doesn't close it completely. If this were the case, I'd expect every app in the world to be crashing randomly.

andrewcraik commented 4 years ago

I don't think we can take the 2% hit on daytrader3 - that is a hit that will show up a lot more for some people and it is an indication that it could be a real performance issue. If we can get away with handling after the return of helper calls and JNI calls with the VM clearing it before calls to JITCODE then I think we should be ok and can get away without the major perf hit.

JamesKingdon commented 4 years ago

That seems like a reasonable compromise given the cost and the difficulty of reproducing the problem. Does the VM already clear DF before calling the JIT or is that a new requirement?

JamesKingdon commented 4 years ago

On the remote possibility that the DF is being set by an interrupt handler, clearing it before the rep tightens the window but doesn't close it completely. If this were the case, I'd expect every app in the world to be crashing randomly.

You sure know how to terrify the service guy :)

gacholio commented 4 years ago

Given that this has never been reproduced, I'd be tempted to ignore it completely for now.

JamesKingdon commented 4 years ago

The symptoms sound like a fit for several cases that went unsolved over the last year or two. I've been looking for over-run from objects before the corrupted one and not finding anything - I hadn't considered the possibility of a backwards rep sto, not least because the fields immediately after the corruption are valid. If this is the problem I'm very keen to get it fixed.

gacholio commented 4 years ago

It's a simple enough matter for the VM to clear the DF on return from any standard JIT helper and on transition into compiled code. The JIT will need to be responsible for clearing it after any C call from the compiled code, including the direct-call JIT helpers and direct JNI (and whatever else you may call).

gacholio commented 4 years ago

Extremely complicated VM fix here:

https://github.com/gacholio/openj9/tree/df

liqunl commented 4 years ago

I find the fix is more complicated than I thought. Regarding clearing the flag after the helper calls, my initial fix only cleared the flag for helper calls that associated with a call node, as there is a designated place to evaluating such node. But we generate a lot of helper calls in other evaluators for nodes that are not call to a helper, such as resolve check, interface calls, field access (helper calls can be generated for field watch). Some helpers are written in assembly and may call to the VM.

I wonder if we should do the flag clearing for all of the helpers, or if there are some of them we can trust. If we need to do the clearing, where is the best place to do so.

gacholio commented 4 years ago

For minimal impact, I would clear it only on transition to compiled code and on return from direct JNI - if we can't trust the C compiler used by the VM to follow the ABI, then we're never going to get this fixed. I suppose we would also need to clear it on return from any helper that might call a JVMTI event.

Having said that, the test that's failing here isn't using any foreign native libraries, so I can't explain how this could occur.

liqunl commented 4 years ago

@andrewcraik @JamesKingdon What are your thoughts? Should we only clear the flag on return from foreign native code and trust VM code? It sounds reasonable to me and we have to trust the C compiler. Although the fix may not fix this issue as the test doesn't use any foreign native libraries as GAC said.

JamesKingdon commented 4 years ago

I'll take whatever we can get for a reasonable cost. I'll keep a close eye on future cases and over time we can see if the problem goes away. It's frustrating that we can't reproduce the problem and confirm the cause, but thank you for having tried.

andrewcraik commented 4 years ago

So I'd say let's not trust the foreign code and let's trust the VM for now. It is possible the bit just flipped because that kind of thing can happen once in a blue moon, but let's monitor the situation and if we see more examples or suspected examples we can get more paranoid. At least not trusting foreign code eliminates one class of issues.

gacholio commented 4 years ago

You can't trust the VM (and hence many JIT helpers) - if we call a bad native, the DF may end up being set (again, we should be crashing all over the place if this were actually occurring).

gacholio commented 4 years ago

libffi does not appear to manually clear the flag on return - I suppose we could do it with some inline assembly in the interpreter.

andrewcraik commented 4 years ago

@JamesKingdon thoughts?

JamesKingdon commented 4 years ago

Rip out rep sto and burn it. If we can't guarantee that it's going to write to the correct memory then I don't think we can use it.

andrewcraik commented 4 years ago

Not an option right now - that is a huge change - there is perf work to look at the sequences, but in the short to medium term we need some kind of halfway point. I think maybe protect ffi and protect the calls we need to protect in the JIT.

liqunl commented 4 years ago

@andrewcraik @JamesKingdon Just to clarify before I make the change. We will trust VM given that VM will clear the flag on return of foreign native call. This means that the JIT only need to clear flag after the return of JNI call and helper that may call JVMTI. Am I right?

@gacholio What helpers can call to JVMTI?

andrewcraik commented 4 years ago

I think before we make a decision we should discuss as a group - I know @DanHeidinga is interested in discussing - let me try to set something up amongst the 5 of us tomorrow so we can get this nailed down and decide what we think the right thing to do is for 0.21 and we can summarize the result here for anyone interested.

liqunl commented 4 years ago

Summary of an offline discussion between James, GAC, Andrew, Dan and me:

Since we're not able to reproduce the crash nor detecting incorrect DF flag, we started to think if it's caused by other things. We have considered other possibilities that may corrupt the object header.

The DF flag bit is flipped by radioactive, or there is a C compiler bug generating bad VM/JIT code that sets the flag, or the header is stomped by other thread.

The first two are unlikely as James has seen several defects with corrupted memory that fits into this pattern (header is zero but all fields look right), and those defects are on different OS.

The third one is ruled out after we took a look at the RFlag in javacore, it's 0x0000000000010646, the DF flag mask is 0x0400, so the flag was indeed set when the crash occurred. Given the short window between the rep stosb instruction and the crash, the flag was set when rep stosb executed. The original theory is confirmed.

However, we still don't know when the flag was set and who set it. Clearing the flag before each rep instruction will likely solve this problem but the performance cost is too large. And we can't clear it on the return of foreign native because we don't know if it's set by foreign native.

The proposal is that we defer this issue to next release, and we'll add diagnostic instructions in the jitted method to detect the situation where DF is incorrectly set and crash the program in an attempt to catch the case in our builds. The diagnostic feature can be turned on by setting an evironment variable and added after this release. @andrewcraik @DanHeidinga

DanHeidinga commented 4 years ago

The proposal is that we defer this issue to next release,

+1

and we'll add diagnostic instructions in the jitted method to detect the situation where DF is incorrectly set and crash the program in an attempt to catch the case in our builds. The diagnostic feature can be turned on by setting an evironment variable and added after this release.

I think this makes a lot of sense. Let's grind the testing with the option enabled and see if we can find another occurrence of this issue.

andrewcraik commented 4 years ago

Moved to 0.22