Open qooalt opened 4 months ago
Is it feasible to run using -Xint
to see if the problem still occurs without the JIT enabled?
@pshipton, I just tried with -Xint
and it works.
@hzongaro fyi
@a7ehuo, may I ask you to look at this problem?
@qooalt Could you take a look at the steps and the logs to confirm what I observed is the same as what you saw? Thank you very much in advance!
mvn package
helloworldsvg\target\helloworldsvg.war
under C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\webapps
"C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2w.exe"
General
, click Start
Hmmm… can't reach this page
and Tomcat Service Status is changed from Started
to Stopped
Apache Tomcat/10.1.26
page and Tomcat Service Status remains as Started
I don't see much in the logs yet under "C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\logs"
. Attach the logs for your reference: logs.zip
Found the following in the Event Viewer:
Faulting application name: Tomcat10-2.exe, version: 1.4.0.0, time stamp: 0x664770c7
Faulting module name: ntdll.dll, version: 10.0.22621.3527, time stamp: 0x92b2df34
Exception code: 0xc0000409
Fault offset: 0x000000000006d435
Faulting process id: 0x0x13AC
Faulting application start time: 0x0x1DADD2E2DCF78EB
Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: fa66b926-3b23-4ea9-9ae2-5ac5ef95ab6b
Faulting package full name:
Faulting package-relative application ID:
So far I narrowed it down to one compiled method java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V
.
Bad
-Xjit:limitFile=(C:\Users\Administrator\Desktop\Issue-19892\limit.txt,547,547),verbose,vlog=C:\Users\Administrator\Desktop\Issue-19892\vlog,{java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}(traceFull,traceInlining,debugInliner,lastOptIndex=16,lastOptSubIndex=4,log=C:\Users\Administrator\Desktop\Issue-19892\jitlog)
Good
-Xjit:limitFile=(C:\Users\Administrator\Desktop\Issue-19892\limit.txt,547,547),verbose,vlog=C:\Users\Administrator\Desktop\Issue-19892\vlog,{java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}(traceFull,traceInlining,debugInliner,lastOptIndex=16,lastOptSubIndex=3,log=C:\Users\Administrator\Desktop\Issue-19892\jitlog)
And it's possibly related to lastOptIndex=16,lastOptSubIndex=4
inlining java/lang/String.length()I
into URL.<init>
.
[ 14] 16.4 O^O INLINER: Inlining qwerty java/lang/String.length()I into java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V with a virtual guard kind=HCRGuard type=NonoverriddenTest partialInline=0
@qooalt May I ask you to run with the following java option to exclude the method and see if the problem goes away for you? If the problem goes away, at least it confirms we are looking at the right method. Thank you very much!
-Xjit:exclude={java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}
-Xjit:exclude={java/net/URL.
(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}
@a7ehuo, confirmed that with this java option it works.
@qooalt Could you take a look at the steps and the logs to confirm what I observed is the same as what you saw? Thank you very much in advance!
- step 1: installed Tomcat 10.1.26
- step 2: run
mvn package
- step 3: copy
helloworldsvg\target\helloworldsvg.war
underC:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\webapps
step 4
- start
"C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2w.exe"
- Under
General
, clickStart
- Accessing http://localhost:8080/helloworldsvg, I got
Hmmm… can't reach this page
and Tomcat Service Status is changed fromStarted
toStopped
- Accessing http://localhost:8080 is fine, which brings up
Apache Tomcat/10.1.26
page and Tomcat Service Status remains asStarted
I don't see much in the logs yet under
"C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\logs"
. Attach the logs for your reference: logs.zipFound the following in the Event Viewer:
Faulting application name: Tomcat10-2.exe, version: 1.4.0.0, time stamp: 0x664770c7 Faulting module name: ntdll.dll, version: 10.0.22621.3527, time stamp: 0x92b2df34 Exception code: 0xc0000409 Fault offset: 0x000000000006d435 Faulting process id: 0x0x13AC Faulting application start time: 0x0x1DADD2E2DCF78EB Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2.exe Faulting module path: C:\Windows\SYSTEM32\ntdll.dll Report Id: fa66b926-3b23-4ea9-9ae2-5ac5ef95ab6b Faulting package full name: Faulting package-relative application ID:
Yes, that's exactly it.
I compared the passed case [1] and the failed case [2]. The difference between them in IL trees is block_67
that inlines String.length()
.
java/lang/String.length()I
into java/net/URL.<init>
java/lang/String.length()I
but do not inline java/lang/String.coder()B
I attached the debugger to the running Tomcat10-2.exe
and the first unhanded exception is [3]
(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000 test byte ptr [rax],0 ds:00000000`00000000=??
Stepping by instruction, the next instruction hits exception c0000409
:
(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000 test byte ptr [rax],0 ds:00000000`00000000=??
0:019> t
t
(1f30.2ac8): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0xd FAST_FAIL_INVALID_SET_OF_CONTEXT
ntdll!RtlGuardRestoreContext+0x65:
00007ffd`dd5fd915 cd29 int 29h
0:019> kp
kp
# Child-SP RetAddr Call Site
00 00000000`00707490 00007ffd`dd63404c ntdll!RtlGuardRestoreContext+0x65
01 00000000`007074c0 00007ffd`44a001bc ntdll!KiUserExceptionDispatch+0x3c
02 00000000`00707bd0 00000000`ffffffff 0x00007ffd`44a001bc
03 00000000`00707bd8 00000000`00026250 0xffffffff
04 00000000`00707be0 00000000`00038bd0 0x26250
05 00000000`00707be8 00000000`00000000 0x38bd0
This test
instruction (00007ffd44a001bc
) is part of the generated code for NULLCHK on parm spec
for URL.<init>
in block_67
when String.length
is inlined [4]. spec
is stored in rsp+0x88
(0x707bd0+0x88
=0x707C58
) or rax
. It shows spec
is NULL [5]. The NULLCHK implementation is introduced in https://github.com/eclipse-openj9/openj9/pull/5708 in 2019. It accesses the object and a trap will be generated if the object reference is NULL, which should result in the appropriate null pointer exception.
What is introduced in https://github.com/apache/commons-daemon/commit/fed36896cbac1c7b524a047f4e973228d2d41ab7 is that it enables Control Flow Guard (CFG): -guard:cf
. My limited understanding of it is that it puts restrictions on the memory that a program can execute. I just suspect using test
instruction dereferencing a null address that causes access violation might not work well with CFG, although I am not clear on how exactly CFG handles this particular case.
I also tried adding disableTraps
in -Xjit options, which disables trap instructions. Then the generated NULLCHK will have an explicit check on object reference and explicit handling of raising the NPE [6]. The crash goes away.
[1] Passed case
n2n BBStart <block_5> (freq 2543) [0x00007DF42C192CD0] bci=[-1,49,614] rc=0 vc=122 vn=- li=- udi=- nc=0
n79n NULLCHK on n77n [#32] [0x00007DF42C1D0870] bci=[-1,50,614] rc=0 vc=122 vn=- li=- udi=- nc=1
n78n icall java/lang/String.length()I[#438 final virtual Method -64] [flags 0x20500 0x0 ] [0x00007DF42C1D0820] bci=[-1,50,614] rc=2 vc=123 vn=- li=- udi=- nc=1
n77n aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0x40000107 0x0 ] [0x00007DF42C1D07D0] bci=[-1,49,614] rc=1 vc=122 vn=- li=- udi=- nc=0
n80n istore limit<auto slot 6>[#439 Auto] [flags 0x3 0x0 ] [0x00007DF42C1D08C0] bci=[-1,53,614] rc=0 vc=122 vn=- li=- udi=- nc=1
n78n ==>icall
n1n BBEnd </block_5> ===== [0x00007DF42C192C80] bci=[-1,53,614] rc=0 vc=122 vn=- li=- udi=- nc=0
n6n BBStart <block_6> (freq 2553) [0x00007DF42C192E10] bci=[-1,55,615] rc=0 vc=122 vn=- li=- udi=- nc=0
n90n ificmple --> block_9 BBStart at n4n () [0x00007DF42C1D0BE0] bci=[-1,57,615] rc=0 vc=122 vn=- li=- udi=- nc=2 flg=0x20
n86n iload limit<auto slot 6>[#439 Auto] [flags 0x3 0x0 ] [0x00007DF42C1D0AA0] bci=[-1,55,615] rc=1 vc=122 vn=- li=- udi=- nc=0
n87n iconst 0 (X==0 X>=0 X<=0 ) [0x00007DF42C1D0AF0] bci=[-1,57,615] rc=1 vc=122 vn=- li=- udi=- nc=0 flg=0x302
n5n BBEnd </block_6> ===== [0x00007DF42C192DC0] bci=[-1,57,615] rc=0 vc=122 vn=- li=- udi=- nc=0
[2] Failed case
n2n BBStart <block_5> (freq 4081) [0x00007DF468282CD0] bci=[-1,49,614] rc=0 vc=119 vn=- li=- udi=- nc=0
n1970n treetop [0x00007DF468485780] bci=[-1,49,614] rc=0 vc=0 vn=- li=- udi=- nc=1
n77n aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0x40000107 0x0 ] [0x00007DF4682C07D0] bci=[-1,49,614] rc=1 vc=119 vn=- li=- udi=- nc=0
n1980n ificmpne --> block_69 BBStart at n1977n (HCRGuard/NonoverriddenTest ) [0x00007DF468485AA0] bci=[2,0,2090] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x1020
n1978n iload unknown static[#708 Static] [flags 0x10303 0x0 ] [0x00007DF468485A00] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0
n1979n iconst 0 (X==0 X>=0 X<=0 ) [0x00007DF468485A50] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x302
n1973n BBEnd </block_5> ===== [0x00007DF468485870] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=0
n1972n BBStart <block_67> (freq 4081) [0x00007DF468485820] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=0
n1985n NULLCHK on n1987n [#32] [0x00007DF468485C30] bci=[-1,49,614] rc=0 vc=0 vn=- li=- udi=- nc=1
n1984n PassThrough [0x00007DF468485BE0] bci=[-1,49,614] rc=1 vc=0 vn=- li=- udi=- nc=1
n1987n aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0x40000107 0x0 ] [0x00007DF468485CD0] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0
n1962n compressedRefs [0x00007DF468485500] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=2
n1960n aloadi java/lang/String.value [B[#705 final Shadow +4] [flags 0x400a0607 0x0 ] [0x00007DF468485460] bci=[2,1,2090] rc=2 vc=0 vn=- li=- udi=- nc=1
n1959n aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0x40000107 0x0 ] (X!=0 ) [0x00007DF468485410] bci=[2,0,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n1961n lconst 0 (highWordZero X==0 X>=0 X<=0 ) [0x00007DF4684854B0] bci=[2,1,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4302
n1964n treetop [0x00007DF4684855A0] bci=[2,4,2090] rc=0 vc=0 vn=- li=- udi=- nc=1
n1963n arraylength [0x00007DF468485550] bci=[2,4,2090] rc=2 vc=0 vn=- li=- udi=- nc=1
n1960n ==>aloadi
n1967n treetop [0x00007DF468485690] bci=[2,6,2090] rc=0 vc=0 vn=- li=- udi=- nc=1
n1966n icall java/lang/String.coder()B[#707 final virtual Method -600] [flags 0x20500 0x0 ] [0x00007DF468485640] bci=[2,6,2090] rc=2 vc=120 vn=- li=- udi=- nc=1
n1965n aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0x40000107 0x0 ] (X!=0 ) [0x00007DF4684855F0] bci=[2,5,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n1971n istore limit<auto slot 6>[#439 Auto] [flags 0x3 0x0 ] [0x00007DF4684857D0] bci=[-1,53,614] rc=0 vc=119 vn=- li=- udi=- nc=1
n1968n ishr [0x00007DF4684856E0] bci=[2,9,2090] rc=1 vc=0 vn=- li=- udi=- nc=2
n1963n ==>arraylength
n1966n ==>icall
n1975n BBEnd </block_67> ===== [0x00007DF468485910] bci=[-1,50,614] rc=0 vc=0 vn=- li=- udi=- nc=0
n1974n BBStart <block_68> (freq 4081) [0x00007DF4684858C0] bci=[-1,50,614] rc=0 vc=0 vn=- li=- udi=- nc=0
n1n BBEnd </block_68> ===== [0x00007DF468282C80] bci=[-1,53,614] rc=0 vc=119 vn=- li=- udi=- nc=0
n6n BBStart <block_6> (freq 3161) [0x00007DF468282E10] bci=[-1,55,615] rc=0 vc=119 vn=- li=- udi=- nc=0
n90n ificmple --> block_9 BBStart at n4n () [0x00007DF4682C0BE0] bci=[-1,57,615] rc=0 vc=119 vn=- li=- udi=- nc=2 flg=0x20
n86n iload limit<auto slot 6>[#439 Auto] [flags 0x3 0x0 ] [0x00007DF4682C0AA0] bci=[-1,55,615] rc=1 vc=119 vn=- li=- udi=- nc=0
n87n iconst 0 (X==0 X>=0 X<=0 ) [0x00007DF4682C0AF0] bci=[-1,57,615] rc=1 vc=119 vn=- li=- udi=- nc=0 flg=0x302
n5n BBEnd </block_6> ===== [0x00007DF468282DC0] bci=[-1,57,615] rc=0 vc=119 vn=- li=- udi=- nc=0
[3]
0:034> gn
gn
ModLoad: 00000262`ea320000 00000262`ea4b0000 C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\awt.dll
ModLoad: 00007ffd`dcab0000 00007ffd`dcb87000 C:\Windows\System32\OLEAUT32.dll
ModLoad: 00007ffd`d78d0000 00007ffd`d7967000 C:\Windows\SYSTEM32\apphelp.dll
ModLoad: 00007ffd`bc6b0000 00007ffd`bc943000 C:\Windows\WinSxS\amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.22621.3672_none_2713b9d173822955\COMCTL32.dll
ModLoad: 00007ffd`d7de0000 00007ffd`d7e8f000 C:\Windows\system32\uxtheme.dll
ModLoad: 00007ffd`dd3f0000 00007ffd`dd543000 C:\Windows\System32\MSCTF.dll
ModLoad: 00007ffd`dc640000 00007ffd`dc7e5000 C:\Windows\System32\ole32.dll
ModLoad: 00007ffd`d8020000 00007ffd`d804b000 C:\Windows\system32\DWMAPI.DLL
ModLoad: 00007ffd`a4a90000 00007ffd`a4b17000 C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\freetype.dll
ModLoad: 00000262`eafa0000 00000262`eb082000 C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\fontmanager.dll
(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000 test byte ptr [rax],0 ds:00000000`00000000=??
0:019> kp
kp
WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds.
# Child-SP RetAddr Call Site
00 00000000`00707bd0 00000000`ffffffff 0x00007ffd`44a001bc
01 00000000`00707bd8 00000000`00026250 0xffffffff
02 00000000`00707be0 00000000`00038bd0 0x26250
03 00000000`00707be8 00000000`00000000 0x38bd0
0:019> t
t
(1f30.2ac8): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0xd FAST_FAIL_INVALID_SET_OF_CONTEXT
ntdll!RtlGuardRestoreContext+0x65:
00007ffd`dd5fd915 cd29 int 29h
0:019> kp
kp
# Child-SP RetAddr Call Site
00 00000000`00707490 00007ffd`dd63404c ntdll!RtlGuardRestoreContext+0x65
01 00000000`007074c0 00007ffd`44a001bc ntdll!KiUserExceptionDispatch+0x3c
02 00000000`00707bd0 00000000`ffffffff 0x00007ffd`44a001bc
03 00000000`00707bd8 00000000`00026250 0xffffffff
04 00000000`00707be0 00000000`00038bd0 0x26250
05 00000000`00707be8 00000000`00000000 0x38bd0
[4]
0x00007FFD44A001B4 0000014c [0x00007DF46877FDB0] 48 8b 84 24 88 00 00 00 mov rax, qword ptr [rsp+0x88] ; L8RegMem, SymRef spec<parm 2 Ljava/lang/String;>[#877 Parm +136] [flags 0xc0000107 0x0 ]
0x00007FFD44A001BC 00000154 [0x00007DF46877FED0] f6 00 00 test byte ptr [rax], 0x00 ; TEST1MemImm1
------------------------------
n1985n ( 0) NULLCHK on n1987n [#32] [0x00007DF468485C30] bci=[-1,49,614] rc=0 vc=179 vn=- li=67 udi=- nc=1
n1984n ( 0) PassThrough [0x00007DF468485BE0] bci=[-1,49,614] rc=0 vc=179 vn=- li=67 udi=64816 nc=1
n1987n ( 0) aload spec<parm 2 Ljava/lang/String;>[#424 Parm] [flags 0xc0000107 0x0 ] (in &GPR_0176) (X!=0 ) [0x00007DF468485CD0] bci=[-1,50,614] rc=0 vc=179 vn=- li=67 udi=64816 nc=0 flg=0x4
------------------------------
[0x00007DF46877FDB0] mov &GPR_0176, qword ptr [vfp+0x10] ; L8RegMem, SymRef spec<parm 2 Ljava/lang/String;>[#877 Parm] [flags 0xc0000107 0x0 ]
[0x00007DF46877FED0] test byte ptr [&GPR_0176], 0x00 ; TEST1MemImm1
[5]
0:019> r
r
rax=0000000000000000 rbx=0000000000000085 rcx=00000000fe9e2010
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00007ffd44a001bc rsp=0000000000707bd0 rbp=0000000000689a00
r8=0000000000000020 r9=0000000000000000 r10=0000000000000058
r11=00007ffdb8f2e420 r12=0000000000000015 r13=00007ffdb8f10000
r14=0000000000533e00 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
00007ffd`44a001bc f60000 test byte ptr [rax],0 ds:00000000`00000000=??
0:019> dq 0x707C58 L10
dq 0x707C58 L10
00000000`00707c58 00000000`00000000 00000000`00000000
00000000`00707c68 00000000`fe9e2010 00000000`fe9e2010
00000000`00707c78 00000000`000d49b8 00000262`d506c75b
00000000`00707c88 00000000`00707cc8 00000000`00000000
00000000`00707c98 00000000`00000000 00000000`fe9e2010
00000000`00707ca8 00000000`00819b70 00000262`cb8d5c39
00000000`00707cb8 00000000`00707d40 00000000`00000000
00000000`00707cc8 00000000`fe9e2010 00000000`fe9e2010
[6]
[0x00007DF472F00400] cmp qword ptr [rsp+0x88], 0x00000000 ; CMP8MemImms, SymRef spec<parm 2 Ljava/lang/String;>[#863 Parm +136] [flags 0xc0000107 0x0 ]
[0x00007DF472F00500] je Snippet Label L0225 ; JE4 ; (Check Failure Snippet)
What is introduced in apache/commons-daemon@fed3689 is that it enables Control Flow Guard (CFG):
-guard:cf
. My limited understanding of it is that it puts restrictions on the memory that a program can execute. I just suspect usingtest
instruction dereferencing a null address that causes access violation might not work well with CFG, although I am not clear on how exactly CFG handles this particular case.I also tried adding
disableTraps
in -Xjit options, which disables trap instructions. Then the generated NULLCHK will have an explicit check on object reference and explicit handling of raising the NPE [6]. The crash goes away.
@a7ehuo, not sure to get that. Are you saying that this case should be allowed, and the problem is in Apache Commons and/or in how Windows (or the compiler) handles the situation when CFG is enabled?
@a7ehuo, not sure to get that. Are you saying that this case should be allowed, and the problem is in Apache Commons and/or in how Windows (or the compiler) handles the situation when CFG is enabled?
I'm still looking. I'll update when I have some answer
Just an update on what I found on how implicit NULLCHK fails CFG:
# 20 Id: 32a8.1a18 Suspend: 1 Teb: 00000018`d7980000 Unfrozen
# Child-SP RetAddr Call Site
00 00000000`006f8890 00007ffd`dd63404c ntdll!RtlGuardRestoreContext+0x65
01 00000000`006f88c0 00007ffd`2fa001b4 ntdll!KiUserExceptionDispatch+0x3c
02 00000000`006f8ff0 00000000`ffffffff 0x00007ffd`2fa001b4 //<--- test byte ptr [rax],0 ds:00000000`00000000=??
03 00000000`006f8ff8 00000000`00026250 0xffffffff
04 00000000`006f9000 00000000`00038bd0 0x26250
05 00000000`006f9008 00000000`00000000 0x38bd0
Sequence of calls
implicit NULLCHK (`test` access the first page)
+-> ntdll!KiUserExceptionDispatch
+-> ntdll!RtlGuardRestoreContext
+-> ntdll!RtlGuardIsValidStackPointer check if SP is within the limit
If CFG is enabled, RtlGuardIsValidStackPointer
is invoked in RtlGuardRestoreContext
to check if the stack pointer is within the system stack limit specified in TEB (Thread Environment Block). In our implicit NULLCHK case where the test
instruction accesses the first page, after the trap happens, the stack pointer is outside the stack limit and fails RtlGuardIsValidStackPointer
.
When the test instruction (0x00007ffd2fa001b4
) causes a SIGSEGV, the SP is 0x00000000006f8ff0
0:020> r
r
rax=0000000000000000 rbx=0000000000000085 rcx=00000000fecf14d0
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00007ffd2fa001b4 rsp=00000000006f8ff0 rbp=0000000000691c00
r8=0000000000000020 r9=0000000000000000 r10=0000012c1e90aba0
r11=0000012c1e90abb0 r12=0000000000000015 r13=00007ffd521f0000
r14=0000000000534200 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
00007ffd`2fa001b4 f60000 test byte ptr [rax],0 ds:00000000`00000000=??
TEB.StackLimit is 0x00000018d8e00000
0:020> !teb
!teb
TEB at 00000018d7980000
ExceptionList: 0000000000000000
StackBase: 00000018d8e80000
StackLimit: 00000018d8e00000 //<------ used by RtlGuardIsValidStackPointer to check SP
SubSystemTib: 0000000000000000
FiberData: 0000000000001e00
ArbitraryUserPointer: 0000000000000000
Self: 00000018d7980000
EnvironmentPointer: 0000000000000000
ClientId: 00000000000032a8 . 0000000000001a18
RpcHandle: 0000000000000000
Tls Storage: 0000012c1e9099e0
PEB Address: 00000018d7955000
LastErrorValue: 0
LastStatusValue: c000003a
Count Owned Locks: 0
HardErrorMode: 0
In RtlGuardIsValidStackPointer
:
The SP (0x6f8ff0
) is saved in rcx
, TEB.StackLimit (0x18d8e00000
) is in rax
.
(1) Makes sure SP (0x6f8ff0
) is >= TEB.StackLimit (0x18d8e00000
) and <= TEB.StackBase (0x18d8e80000
).
(2) Not sure of this step, it checks if some saved pointer on the stack is the same as TEB.StackBase (0x18d8e80000
)
(3) Check if SP (0x6f8ff0
) >= (TEB.StackLimit (0x18d8e00000
) - 0x1000
)
The Java stack pointer SP (0x6f8ff0
) when the trap happens doesn’t fit into the system stack limits required by CFG RtlGuardIsValidStackPointer
. After the three checks fail, RtlGuardIsValidStackPointer
returns false and that is where RtlGuardRestoreContext
throws the exception.
ntdll!RtlGuardIsValidStackPointer:
00007ffd`dd5fd9a8 4c8bdc mov r11, rsp
00007ffd`dd5fd9ab 49895310 mov qword ptr [r11+10h], rdx
00007ffd`dd5fd9af 65488b142530000000 mov rdx, qword ptr gs:[30h]
00007ffd`dd5fd9b8 488b8278140000 mov rax, qword ptr [rdx+1478h]
00007ffd`dd5fd9bf 49894320 mov qword ptr [r11+20h], rax
00007ffd`dd5fd9c3 488b4208 mov rax, qword ptr [rdx+8]
00007ffd`dd5fd9c7 49894310 mov qword ptr [r11+10h], rax
00007ffd`dd5fd9cb 488b4210 mov rax, qword ptr [rdx+10h]
00007ffd`dd5fd9cf 49894318 mov qword ptr [r11+18h], rax
00007ffd`dd5fd9d3 483bc8 cmp rcx, rax // (1) rcx=00000000006f8ff0, rax=00000018d8e00000
00007ffd`dd5fd9d6 720d jb ntdll!RtlGuardIsValidStackPointer+0x3d (7ffddd5fd9e5) //---+
|
00007ffd`dd5fd9d8 493b4b10 cmp rcx, qword ptr [r11+10h] |
00007ffd`dd5fd9dc 7707 ja ntdll!RtlGuardIsValidStackPointer+0x3d (7ffddd5fd9e5) |
00007ffd`dd5fd9de b801000000 mov eax, 1 |
00007ffd`dd5fd9e3 eb24 jmp ntdll!RtlGuardIsValidStackPointer+0x61 (7ffddd5fda09) |
|
00007ffd`dd5fd9e5 483b442420 cmp rax, qword ptr [rsp+20h] //<--+ (2) rax=00000018d8e00000, rsp=00000000006f8888, [0x6f8888+20h]=00000018d8d80000
00007ffd`dd5fd9ea 7419 je ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05)
00007ffd`dd5fd9ec 482d00100000 sub rax, 1000h // (3) rax=00000018d8e00000
00007ffd`dd5fd9f2 483bc8 cmp rcx, rax // rcx=00000000006f8ff0, rax=00000018d8dff000
00007ffd`dd5fd9f5 720e jb ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05) //---+
00007ffd`dd5fd9f7 483b4c2410 cmp rcx, qword ptr [rsp+10h] |
00007ffd`dd5fd9fc 7707 ja ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05) |
00007ffd`dd5fd9fe b801000000 mov eax, 1 |
00007ffd`dd5fda03 eb04 jmp ntdll!RtlGuardIsValidStackPointer+0x61 (7ffddd5fda09) |
00007ffd`dd5fda05 eb00 jmp ntdll!RtlGuardIsValidStackPointer+0x5f (7ffddd5fda07) //<--+ return 0
00007ffd`dd5fda07 33c0 xor eax, eax
00007ffd`dd5fda09 c3 ret
@a7ehuo, is there anything the JIT (or the JVM) can do about this? Moving this out to 0.49 for now. . . .
@a7ehuo, is there anything the JIT (or the JVM) can do about this? ...
Unfortunately, I have not found anything that the JIT (or the JVM) can do about this case when CFG is enabled
Hi! I've also seen this using tomcat 9.0.94. So it's not only on tomcat 10. For now I just disabled CFG from Exploit programs in Windows Settings.
Hi @a7ehuo , I see this PR getting referenced from cases so I just wanted to check on a couple of things.
If the problem happens because the Java SP is outside of the bounds of the C SP listed in the TEB, this sounds like it should be the normal case for most of our compiled code. Is there any reason to think it would be limited to a small number of jit'd methods?
If the problem stems from the implicit NP checks, would -Xrs:sync
be an effective work-around?
Oh, looks like it's not implicit nullchecks but a modified explicit nullcheck? I read more and found your reference to #5708 and the suggestion to use disableTraps
.
If the problem happens because the Java SP is outside of the bounds of the C SP listed in the TEB, this sounds like it should be the normal case for most of our compiled code. Is there any reason to think it would be limited to a small number of jit'd methods?
This is also what my understanding is, but I don't know why it limits to a number of JIT'd methods
If the problem stems from the implicit NP checks, would -Xrs:sync be an effective work-around?
Here are a fews of workarounds
-Xrs
-Xjit:disableTraps
Tomcat service in Windows with openJ9 crashes when certain Java code is executed. This seems to happen because starting with 1.4.0, Commons Daemon enables the Control Flow Guard flag, and some specific java code executed by the openj9 runtime triggers it, killing the process. The problem does occur when the code attached is executed, but not only with it. This is just the simplest way I found to recreate the problem.
The same code works with Tomcat 10.1.24 (which uses Commons Daemon 1.3.4, if I'm not wrong), but crashes both with 10.1.25 and 10.1.26 (both using Commons Daemon 1.4.0).
This same code does work fine using Temurin runtime, with the same Java version as the used with Semeru openJ9 (17.0.11_9 x64)
Steps to reproduce:
No significant messages are written in the log files, and the error shown in the event viewer shows the following information:
Faulting application name: Tomcat10125.exe, version: 1.4.0.0, time stamp: 0x664770c7 Faulting module name: ntdll.dll, version: 10.0.22621.3733, time stamp: 0x67ca8829 Exception code: 0xc0000409 Fault offset: 0x000000000006d915 Faulting process id: 0x0xCF60 Faulting application start time: 0x0x1DAD29C07116CFB Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10125\bin\Tomcat10125.exe Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll Report Id: 59f71851-e580-428b-84ce-d1ac220970f4 Faulting package full name: Faulting package-relative application ID:
As said, based on my research, the issue seems to be related to this commit in Commons Daemon.
https://github.com/apache/commons-daemon/commit/fed36896cbac1c7b524a047f4e973228d2d41ab7
The tests I performed:
I built prunsrv.exe in debug mode, replaced Tomcat10.exe by it, and it does not crash.
I built prunsrv.exe in release mode, replaced Tomcat10.exe by it, and it does crash.
I built prunsrv.exe in release mode commenting out the Control Flow Guard flag introduced in that commit, and it does not crash.
I installed replaced Semeru by Temurin, same Java version, and it does not crash.
For your reference, I first reported the bug in Tomcat's bugzilla: https://bz.apache.org/bugzilla/show_bug.cgi?id=69180
And then in Commons Daemon, where they claim this is an openj9 issue, not a Commons Daemon one:
https://issues.apache.org/jira/browse/DAEMON-465 helloworldsvg.zip