eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.29k stars 722 forks source link

prunsrv.exe (Tomcat10.exe) killed when run on top of openj9 #19892

Open qooalt opened 4 months ago

qooalt commented 4 months ago

Tomcat service in Windows with openJ9 crashes when certain Java code is executed. This seems to happen because starting with 1.4.0, Commons Daemon enables the Control Flow Guard flag, and some specific java code executed by the openj9 runtime triggers it, killing the process. The problem does occur when the code attached is executed, but not only with it. This is just the simplest way I found to recreate the problem.

The same code works with Tomcat 10.1.24 (which uses Commons Daemon 1.3.4, if I'm not wrong), but crashes both with 10.1.25 and 10.1.26 (both using Commons Daemon 1.4.0).

This same code does work fine using Temurin runtime, with the same Java version as the used with Semeru openJ9 (17.0.11_9 x64)

Steps to reproduce:

  1. Download 10.1.25 or 10.1.26 Windows Installer from https://tomcat.apache.org/download-10.cgi and install it.
  2. Uncompress and build the attached maven project (helloworldsvg.zip).
  3. Deploy the war file generated in step 2 onto the Tomcat installed in 1.
  4. Visit http://localhost:8080/helloworldsvg

No significant messages are written in the log files, and the error shown in the event viewer shows the following information:

Faulting application name: Tomcat10125.exe, version: 1.4.0.0, time stamp: 0x664770c7 Faulting module name: ntdll.dll, version: 10.0.22621.3733, time stamp: 0x67ca8829 Exception code: 0xc0000409 Fault offset: 0x000000000006d915 Faulting process id: 0x0xCF60 Faulting application start time: 0x0x1DAD29C07116CFB Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10125\bin\Tomcat10125.exe Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll Report Id: 59f71851-e580-428b-84ce-d1ac220970f4 Faulting package full name: Faulting package-relative application ID:

As said, based on my research, the issue seems to be related to this commit in Commons Daemon.

https://github.com/apache/commons-daemon/commit/fed36896cbac1c7b524a047f4e973228d2d41ab7

The tests I performed:

I built prunsrv.exe in debug mode, replaced Tomcat10.exe by it, and it does not crash.

I built prunsrv.exe in release mode, replaced Tomcat10.exe by it, and it does crash.

I built prunsrv.exe in release mode commenting out the Control Flow Guard flag introduced in that commit, and it does not crash.

I installed replaced Semeru by Temurin, same Java version, and it does not crash.

For your reference, I first reported the bug in Tomcat's bugzilla: https://bz.apache.org/bugzilla/show_bug.cgi?id=69180

And then in Commons Daemon, where they claim this is an openj9 issue, not a Commons Daemon one:

https://issues.apache.org/jira/browse/DAEMON-465 helloworldsvg.zip

pshipton commented 4 months ago

Is it feasible to run using -Xint to see if the problem still occurs without the JIT enabled?

qooalt commented 4 months ago

@pshipton, I just tried with -Xint and it works.

pshipton commented 4 months ago

@hzongaro fyi

hzongaro commented 4 months ago

@a7ehuo, may I ask you to look at this problem?

a7ehuo commented 4 months ago

@qooalt Could you take a look at the steps and the logs to confirm what I observed is the same as what you saw? Thank you very much in advance!

I don't see much in the logs yet under "C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\logs". Attach the logs for your reference: logs.zip

Found the following in the Event Viewer:

Faulting application name: Tomcat10-2.exe, version: 1.4.0.0, time stamp: 0x664770c7
Faulting module name: ntdll.dll, version: 10.0.22621.3527, time stamp: 0x92b2df34
Exception code: 0xc0000409
Fault offset: 0x000000000006d435
Faulting process id: 0x0x13AC
Faulting application start time: 0x0x1DADD2E2DCF78EB
Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: fa66b926-3b23-4ea9-9ae2-5ac5ef95ab6b
Faulting package full name: 
Faulting package-relative application ID: 
a7ehuo commented 4 months ago

So far I narrowed it down to one compiled method java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V.

Bad

-Xjit:limitFile=(C:\Users\Administrator\Desktop\Issue-19892\limit.txt,547,547),verbose,vlog=C:\Users\Administrator\Desktop\Issue-19892\vlog,{java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}(traceFull,traceInlining,debugInliner,lastOptIndex=16,lastOptSubIndex=4,log=C:\Users\Administrator\Desktop\Issue-19892\jitlog)

Good

-Xjit:limitFile=(C:\Users\Administrator\Desktop\Issue-19892\limit.txt,547,547),verbose,vlog=C:\Users\Administrator\Desktop\Issue-19892\vlog,{java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}(traceFull,traceInlining,debugInliner,lastOptIndex=16,lastOptSubIndex=3,log=C:\Users\Administrator\Desktop\Issue-19892\jitlog)

And it's possibly related to lastOptIndex=16,lastOptSubIndex=4 inlining java/lang/String.length()I into URL.<init>.

[    14]  16.4    O^O INLINER: Inlining qwerty java/lang/String.length()I into java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V with a virtual guard kind=HCRGuard type=NonoverriddenTest partialInline=0

@qooalt May I ask you to run with the following java option to exclude the method and see if the problem goes away for you? If the problem goes away, at least it confirms we are looking at the right method. Thank you very much!

-Xjit:exclude={java/net/URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}
qooalt commented 4 months ago

-Xjit:exclude={java/net/URL.(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V}

@a7ehuo, confirmed that with this java option it works.

qooalt commented 4 months ago

@qooalt Could you take a look at the steps and the logs to confirm what I observed is the same as what you saw? Thank you very much in advance!

  • step 1: installed Tomcat 10.1.26
  • step 2: run mvn package
  • step 3: copy helloworldsvg\target\helloworldsvg.war under C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\webapps
  • step 4

    • start "C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2w.exe"
    • Under General, click Start
    • Accessing http://localhost:8080/helloworldsvg, I got Hmmm… can't reach this page and Tomcat Service Status is changed from Started to Stopped
    • Accessing http://localhost:8080 is fine, which brings up Apache Tomcat/10.1.26 page and Tomcat Service Status remains as Started

I don't see much in the logs yet under "C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\logs". Attach the logs for your reference: logs.zip

Found the following in the Event Viewer:

Faulting application name: Tomcat10-2.exe, version: 1.4.0.0, time stamp: 0x664770c7
Faulting module name: ntdll.dll, version: 10.0.22621.3527, time stamp: 0x92b2df34
Exception code: 0xc0000409
Fault offset: 0x000000000006d435
Faulting process id: 0x0x13AC
Faulting application start time: 0x0x1DADD2E2DCF78EB
Faulting application path: C:\Program Files\Apache Software Foundation\Tomcat 10.1_Tomcat10-2\bin\Tomcat10-2.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: fa66b926-3b23-4ea9-9ae2-5ac5ef95ab6b
Faulting package full name: 
Faulting package-relative application ID: 

Yes, that's exactly it.

a7ehuo commented 4 months ago

I compared the passed case [1] and the failed case [2]. The difference between them in IL trees is block_67 that inlines String.length().

I attached the debugger to the running Tomcat10-2.exe and the first unhanded exception is [3]

(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000          test    byte ptr [rax],0 ds:00000000`00000000=??

Stepping by instruction, the next instruction hits exception c0000409:

(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000          test    byte ptr [rax],0 ds:00000000`00000000=??

0:019> t
t
(1f30.2ac8): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0xd FAST_FAIL_INVALID_SET_OF_CONTEXT 
ntdll!RtlGuardRestoreContext+0x65:
00007ffd`dd5fd915 cd29            int     29h
0:019> kp
kp
 # Child-SP          RetAddr               Call Site
00 00000000`00707490 00007ffd`dd63404c     ntdll!RtlGuardRestoreContext+0x65
01 00000000`007074c0 00007ffd`44a001bc     ntdll!KiUserExceptionDispatch+0x3c
02 00000000`00707bd0 00000000`ffffffff     0x00007ffd`44a001bc
03 00000000`00707bd8 00000000`00026250     0xffffffff
04 00000000`00707be0 00000000`00038bd0     0x26250
05 00000000`00707be8 00000000`00000000     0x38bd0

This test instruction (00007ffd44a001bc) is part of the generated code for NULLCHK on parm spec for URL.<init> in block_67 when String.length is inlined [4]. spec is stored in rsp+0x88 (0x707bd0+0x88=0x707C58 ) or rax. It shows spec is NULL [5]. The NULLCHK implementation is introduced in https://github.com/eclipse-openj9/openj9/pull/5708 in 2019. It accesses the object and a trap will be generated if the object reference is NULL, which should result in the appropriate null pointer exception.

What is introduced in https://github.com/apache/commons-daemon/commit/fed36896cbac1c7b524a047f4e973228d2d41ab7 is that it enables Control Flow Guard (CFG): -guard:cf. My limited understanding of it is that it puts restrictions on the memory that a program can execute. I just suspect using test instruction dereferencing a null address that causes access violation might not work well with CFG, although I am not clear on how exactly CFG handles this particular case.

I also tried adding disableTraps in -Xjit options, which disables trap instructions. Then the generated NULLCHK will have an explicit check on object reference and explicit handling of raising the NPE [6]. The crash goes away.

[1] Passed case

n2n       BBStart <block_5> (freq 2543)                                                       [0x00007DF42C192CD0] bci=[-1,49,614] rc=0 vc=122 vn=- li=- udi=- nc=0
n79n      NULLCHK on n77n [#32]                                                               [0x00007DF42C1D0870] bci=[-1,50,614] rc=0 vc=122 vn=- li=- udi=- nc=1
n78n        icall  java/lang/String.length()I[#438  final virtual Method -64] [flags 0x20500 0x0 ]  [0x00007DF42C1D0820] bci=[-1,50,614] rc=2 vc=123 vn=- li=- udi=- nc=1
n77n          aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0x40000107 0x0 ]      [0x00007DF42C1D07D0] bci=[-1,49,614] rc=1 vc=122 vn=- li=- udi=- nc=0
n80n      istore  limit<auto slot 6>[#439  Auto] [flags 0x3 0x0 ]                             [0x00007DF42C1D08C0] bci=[-1,53,614] rc=0 vc=122 vn=- li=- udi=- nc=1
n78n        ==>icall
n1n       BBEnd </block_5> =====                                                              [0x00007DF42C192C80] bci=[-1,53,614] rc=0 vc=122 vn=- li=- udi=- nc=0

n6n       BBStart <block_6> (freq 2553)                                                       [0x00007DF42C192E10] bci=[-1,55,615] rc=0 vc=122 vn=- li=- udi=- nc=0
n90n      ificmple --> block_9 BBStart at n4n ()                                              [0x00007DF42C1D0BE0] bci=[-1,57,615] rc=0 vc=122 vn=- li=- udi=- nc=2 flg=0x20
n86n        iload  limit<auto slot 6>[#439  Auto] [flags 0x3 0x0 ]                            [0x00007DF42C1D0AA0] bci=[-1,55,615] rc=1 vc=122 vn=- li=- udi=- nc=0
n87n        iconst 0 (X==0 X>=0 X<=0 )                                                        [0x00007DF42C1D0AF0] bci=[-1,57,615] rc=1 vc=122 vn=- li=- udi=- nc=0 flg=0x302
n5n       BBEnd </block_6> =====                                                              [0x00007DF42C192DC0] bci=[-1,57,615] rc=0 vc=122 vn=- li=- udi=- nc=0

[2] Failed case

n2n       BBStart <block_5> (freq 4081)                                                       [0x00007DF468282CD0] bci=[-1,49,614] rc=0 vc=119 vn=- li=- udi=- nc=0
n1970n    treetop                                                                             [0x00007DF468485780] bci=[-1,49,614] rc=0 vc=0 vn=- li=- udi=- nc=1
n77n        aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0x40000107 0x0 ]        [0x00007DF4682C07D0] bci=[-1,49,614] rc=1 vc=119 vn=- li=- udi=- nc=0
n1980n    ificmpne --> block_69 BBStart at n1977n (HCRGuard/NonoverriddenTest )               [0x00007DF468485AA0] bci=[2,0,2090] rc=0 vc=0 vn=- li=- udi=- nc=2 flg=0x1020
n1978n      iload  unknown static[#708  Static] [flags 0x10303 0x0 ]                          [0x00007DF468485A00] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0
n1979n      iconst 0 (X==0 X>=0 X<=0 )                                                        [0x00007DF468485A50] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x302
n1973n    BBEnd </block_5> =====                                                              [0x00007DF468485870] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=0

n1972n    BBStart <block_67> (freq 4081)                                                      [0x00007DF468485820] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=0
n1985n    NULLCHK on n1987n [#32]                                                             [0x00007DF468485C30] bci=[-1,49,614] rc=0 vc=0 vn=- li=- udi=- nc=1
n1984n      PassThrough                                                                       [0x00007DF468485BE0] bci=[-1,49,614] rc=1 vc=0 vn=- li=- udi=- nc=1
n1987n        aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0x40000107 0x0 ]      [0x00007DF468485CD0] bci=[-1,50,614] rc=1 vc=0 vn=- li=- udi=- nc=0
n1962n    compressedRefs                                                                      [0x00007DF468485500] bci=[2,1,2090] rc=0 vc=0 vn=- li=- udi=- nc=2
n1960n      aloadi  java/lang/String.value [B[#705  final Shadow +4] [flags 0x400a0607 0x0 ]  [0x00007DF468485460] bci=[2,1,2090] rc=2 vc=0 vn=- li=- udi=- nc=1
n1959n        aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0x40000107 0x0 ] (X!=0 )  [0x00007DF468485410] bci=[2,0,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n1961n      lconst 0 (highWordZero X==0 X>=0 X<=0 )                                           [0x00007DF4684854B0] bci=[2,1,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4302
n1964n    treetop                                                                             [0x00007DF4684855A0] bci=[2,4,2090] rc=0 vc=0 vn=- li=- udi=- nc=1
n1963n      arraylength                                                                       [0x00007DF468485550] bci=[2,4,2090] rc=2 vc=0 vn=- li=- udi=- nc=1
n1960n        ==>aloadi
n1967n    treetop                                                                             [0x00007DF468485690] bci=[2,6,2090] rc=0 vc=0 vn=- li=- udi=- nc=1
n1966n      icall  java/lang/String.coder()B[#707  final virtual Method -600] [flags 0x20500 0x0 ]  [0x00007DF468485640] bci=[2,6,2090] rc=2 vc=120 vn=- li=- udi=- nc=1
n1965n        aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0x40000107 0x0 ] (X!=0 )  [0x00007DF4684855F0] bci=[2,5,2090] rc=1 vc=0 vn=- li=- udi=- nc=0 flg=0x4
n1971n    istore  limit<auto slot 6>[#439  Auto] [flags 0x3 0x0 ]                             [0x00007DF4684857D0] bci=[-1,53,614] rc=0 vc=119 vn=- li=- udi=- nc=1
n1968n      ishr                                                                              [0x00007DF4684856E0] bci=[2,9,2090] rc=1 vc=0 vn=- li=- udi=- nc=2
n1963n        ==>arraylength
n1966n        ==>icall
n1975n    BBEnd </block_67> =====                                                             [0x00007DF468485910] bci=[-1,50,614] rc=0 vc=0 vn=- li=- udi=- nc=0

n1974n    BBStart <block_68> (freq 4081)                                                      [0x00007DF4684858C0] bci=[-1,50,614] rc=0 vc=0 vn=- li=- udi=- nc=0
n1n       BBEnd </block_68> =====                                                             [0x00007DF468282C80] bci=[-1,53,614] rc=0 vc=119 vn=- li=- udi=- nc=0

n6n       BBStart <block_6> (freq 3161)                                                       [0x00007DF468282E10] bci=[-1,55,615] rc=0 vc=119 vn=- li=- udi=- nc=0
n90n      ificmple --> block_9 BBStart at n4n ()                                              [0x00007DF4682C0BE0] bci=[-1,57,615] rc=0 vc=119 vn=- li=- udi=- nc=2 flg=0x20
n86n        iload  limit<auto slot 6>[#439  Auto] [flags 0x3 0x0 ]                            [0x00007DF4682C0AA0] bci=[-1,55,615] rc=1 vc=119 vn=- li=- udi=- nc=0
n87n        iconst 0 (X==0 X>=0 X<=0 )                                                        [0x00007DF4682C0AF0] bci=[-1,57,615] rc=1 vc=119 vn=- li=- udi=- nc=0 flg=0x302
n5n       BBEnd </block_6> =====                                                              [0x00007DF468282DC0] bci=[-1,57,615] rc=0 vc=119 vn=- li=- udi=- nc=0

[3]

0:034> gn
gn
ModLoad: 00000262`ea320000 00000262`ea4b0000   C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\awt.dll
ModLoad: 00007ffd`dcab0000 00007ffd`dcb87000   C:\Windows\System32\OLEAUT32.dll
ModLoad: 00007ffd`d78d0000 00007ffd`d7967000   C:\Windows\SYSTEM32\apphelp.dll
ModLoad: 00007ffd`bc6b0000 00007ffd`bc943000   C:\Windows\WinSxS\amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.22621.3672_none_2713b9d173822955\COMCTL32.dll
ModLoad: 00007ffd`d7de0000 00007ffd`d7e8f000   C:\Windows\system32\uxtheme.dll
ModLoad: 00007ffd`dd3f0000 00007ffd`dd543000   C:\Windows\System32\MSCTF.dll
ModLoad: 00007ffd`dc640000 00007ffd`dc7e5000   C:\Windows\System32\ole32.dll
ModLoad: 00007ffd`d8020000 00007ffd`d804b000   C:\Windows\system32\DWMAPI.DLL
ModLoad: 00007ffd`a4a90000 00007ffd`a4b17000   C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\freetype.dll
ModLoad: 00000262`eafa0000 00000262`eb082000   C:\Program Files\Issue-19892\ibm-semeru-open-jdk_x64_windows_17.0.11_9_openj9-0.44.0\jdk-17.0.11+9\bin\fontmanager.dll
(1f30.2ac8): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
00007ffd`44a001bc f60000          test    byte ptr [rax],0 ds:00000000`00000000=??

0:019> kp
kp
WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds.
 # Child-SP          RetAddr               Call Site
00 00000000`00707bd0 00000000`ffffffff     0x00007ffd`44a001bc
01 00000000`00707bd8 00000000`00026250     0xffffffff
02 00000000`00707be0 00000000`00038bd0     0x26250
03 00000000`00707be8 00000000`00000000     0x38bd0

0:019> t
t
(1f30.2ac8): Security check failure or stack buffer overrun - code c0000409 (!!! second chance !!!)
Subcode: 0xd FAST_FAIL_INVALID_SET_OF_CONTEXT 
ntdll!RtlGuardRestoreContext+0x65:
00007ffd`dd5fd915 cd29            int     29h
0:019> kp
kp
 # Child-SP          RetAddr               Call Site
00 00000000`00707490 00007ffd`dd63404c     ntdll!RtlGuardRestoreContext+0x65
01 00000000`007074c0 00007ffd`44a001bc     ntdll!KiUserExceptionDispatch+0x3c
02 00000000`00707bd0 00000000`ffffffff     0x00007ffd`44a001bc
03 00000000`00707bd8 00000000`00026250     0xffffffff
04 00000000`00707be0 00000000`00038bd0     0x26250
05 00000000`00707be8 00000000`00000000     0x38bd0

[4]

0x00007FFD44A001B4 0000014c [0x00007DF46877FDB0] 48 8b 84 24 88 00 00 00    mov rax, qword ptr [rsp+0x88]       ; L8RegMem, SymRef  spec<parm 2 Ljava/lang/String;>[#877  Parm +136] [flags 0xc0000107 0x0 ]
0x00007FFD44A001BC 00000154 [0x00007DF46877FED0] f6 00 00                   test    byte ptr [rax], 0x00    ; TEST1MemImm1
------------------------------
 n1985n   (  0)  NULLCHK on n1987n [#32]                                                              [0x00007DF468485C30] bci=[-1,49,614] rc=0 vc=179 vn=- li=67 udi=- nc=1
 n1984n   (  0)    PassThrough                                                                        [0x00007DF468485BE0] bci=[-1,49,614] rc=0 vc=179 vn=- li=67 udi=64816 nc=1
 n1987n   (  0)      aload  spec<parm 2 Ljava/lang/String;>[#424  Parm] [flags 0xc0000107 0x0 ] (in &GPR_0176) (X!=0 )  [0x00007DF468485CD0] bci=[-1,50,614] rc=0 vc=179 vn=- li=67 udi=64816 nc=0 flg=0x4
------------------------------

 [0x00007DF46877FDB0]   mov &GPR_0176, qword ptr [vfp+0x10]     ; L8RegMem, SymRef  spec<parm 2 Ljava/lang/String;>[#877  Parm] [flags 0xc0000107 0x0 ]
 [0x00007DF46877FED0]   test    byte ptr [&GPR_0176], 0x00  ; TEST1MemImm1

[5]

0:019> r
r
rax=0000000000000000 rbx=0000000000000085 rcx=00000000fe9e2010
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00007ffd44a001bc rsp=0000000000707bd0 rbp=0000000000689a00
 r8=0000000000000020  r9=0000000000000000 r10=0000000000000058
r11=00007ffdb8f2e420 r12=0000000000000015 r13=00007ffdb8f10000
r14=0000000000533e00 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
00007ffd`44a001bc f60000          test    byte ptr [rax],0 ds:00000000`00000000=??
0:019> dq 0x707C58 L10
dq 0x707C58 L10
00000000`00707c58  00000000`00000000 00000000`00000000
00000000`00707c68  00000000`fe9e2010 00000000`fe9e2010
00000000`00707c78  00000000`000d49b8 00000262`d506c75b
00000000`00707c88  00000000`00707cc8 00000000`00000000
00000000`00707c98  00000000`00000000 00000000`fe9e2010
00000000`00707ca8  00000000`00819b70 00000262`cb8d5c39
00000000`00707cb8  00000000`00707d40 00000000`00000000
00000000`00707cc8  00000000`fe9e2010 00000000`fe9e2010

[6]

 [0x00007DF472F00400]   cmp qword ptr [rsp+0x88], 0x00000000    ; CMP8MemImms, SymRef  spec<parm 2 Ljava/lang/String;>[#863  Parm +136] [flags 0xc0000107 0x0 ]
 [0x00007DF472F00500]   je  Snippet Label L0225     ; JE4   ; (Check Failure Snippet)
qooalt commented 4 months ago

What is introduced in apache/commons-daemon@fed3689 is that it enables Control Flow Guard (CFG): -guard:cf. My limited understanding of it is that it puts restrictions on the memory that a program can execute. I just suspect using test instruction dereferencing a null address that causes access violation might not work well with CFG, although I am not clear on how exactly CFG handles this particular case.

I also tried adding disableTraps in -Xjit options, which disables trap instructions. Then the generated NULLCHK will have an explicit check on object reference and explicit handling of raising the NPE [6]. The crash goes away.

@a7ehuo, not sure to get that. Are you saying that this case should be allowed, and the problem is in Apache Commons and/or in how Windows (or the compiler) handles the situation when CFG is enabled?

a7ehuo commented 4 months ago

@a7ehuo, not sure to get that. Are you saying that this case should be allowed, and the problem is in Apache Commons and/or in how Windows (or the compiler) handles the situation when CFG is enabled?

I'm still looking. I'll update when I have some answer

a7ehuo commented 3 months ago

Just an update on what I found on how implicit NULLCHK fails CFG:

# 20  Id: 32a8.1a18 Suspend: 1 Teb: 00000018`d7980000 Unfrozen
 # Child-SP          RetAddr               Call Site
00 00000000`006f8890 00007ffd`dd63404c     ntdll!RtlGuardRestoreContext+0x65
01 00000000`006f88c0 00007ffd`2fa001b4     ntdll!KiUserExceptionDispatch+0x3c
02 00000000`006f8ff0 00000000`ffffffff     0x00007ffd`2fa001b4  //<--- test    byte ptr [rax],0 ds:00000000`00000000=??
03 00000000`006f8ff8 00000000`00026250     0xffffffff
04 00000000`006f9000 00000000`00038bd0     0x26250
05 00000000`006f9008 00000000`00000000     0x38bd0
Sequence of calls

   implicit NULLCHK (`test` access the first page)
      +-> ntdll!KiUserExceptionDispatch
            +-> ntdll!RtlGuardRestoreContext  
                  +-> ntdll!RtlGuardIsValidStackPointer check if SP is within the limit

If CFG is enabled, RtlGuardIsValidStackPointer is invoked in RtlGuardRestoreContext to check if the stack pointer is within the system stack limit specified in TEB (Thread Environment Block). In our implicit NULLCHK case where the test instruction accesses the first page, after the trap happens, the stack pointer is outside the stack limit and fails RtlGuardIsValidStackPointer.

When the test instruction (0x00007ffd2fa001b4) causes a SIGSEGV, the SP is 0x00000000006f8ff0

0:020> r
r
rax=0000000000000000 rbx=0000000000000085 rcx=00000000fecf14d0
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=00007ffd2fa001b4 rsp=00000000006f8ff0 rbp=0000000000691c00
 r8=0000000000000020  r9=0000000000000000 r10=0000012c1e90aba0
r11=0000012c1e90abb0 r12=0000000000000015 r13=00007ffd521f0000
r14=0000000000534200 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
00007ffd`2fa001b4 f60000          test    byte ptr [rax],0 ds:00000000`00000000=??

TEB.StackLimit is 0x00000018d8e00000

0:020> !teb
!teb
TEB at 00000018d7980000
    ExceptionList:        0000000000000000
    StackBase:            00000018d8e80000
    StackLimit:           00000018d8e00000 //<------ used by RtlGuardIsValidStackPointer to check SP
    SubSystemTib:         0000000000000000
    FiberData:            0000000000001e00
    ArbitraryUserPointer: 0000000000000000
    Self:                 00000018d7980000
    EnvironmentPointer:   0000000000000000
    ClientId:             00000000000032a8 . 0000000000001a18
    RpcHandle:            0000000000000000
    Tls Storage:          0000012c1e9099e0
    PEB Address:          00000018d7955000
    LastErrorValue:       0
    LastStatusValue:      c000003a
    Count Owned Locks:    0
    HardErrorMode:        0

In RtlGuardIsValidStackPointer: The SP (0x6f8ff0) is saved in rcx, TEB.StackLimit (0x18d8e00000) is in rax. (1) Makes sure SP (0x6f8ff0) is >= TEB.StackLimit (0x18d8e00000) and <= TEB.StackBase (0x18d8e80000). (2) Not sure of this step, it checks if some saved pointer on the stack is the same as TEB.StackBase (0x18d8e80000) (3) Check if SP (0x6f8ff0) >= (TEB.StackLimit (0x18d8e00000) - 0x1000)

The Java stack pointer SP (0x6f8ff0) when the trap happens doesn’t fit into the system stack limits required by CFG RtlGuardIsValidStackPointer. After the three checks fail, RtlGuardIsValidStackPointer returns false and that is where RtlGuardRestoreContext throws the exception.

    ntdll!RtlGuardIsValidStackPointer:
00007ffd`dd5fd9a8 4c8bdc               mov     r11, rsp
00007ffd`dd5fd9ab 49895310             mov     qword ptr [r11+10h], rdx
00007ffd`dd5fd9af 65488b142530000000   mov     rdx, qword ptr gs:[30h]
00007ffd`dd5fd9b8 488b8278140000       mov     rax, qword ptr [rdx+1478h]
00007ffd`dd5fd9bf 49894320             mov     qword ptr [r11+20h], rax
00007ffd`dd5fd9c3 488b4208             mov     rax, qword ptr [rdx+8]
00007ffd`dd5fd9c7 49894310             mov     qword ptr [r11+10h], rax
00007ffd`dd5fd9cb 488b4210             mov     rax, qword ptr [rdx+10h]
00007ffd`dd5fd9cf 49894318             mov     qword ptr [r11+18h], rax
00007ffd`dd5fd9d3 483bc8               cmp     rcx, rax                                              // (1) rcx=00000000006f8ff0, rax=00000018d8e00000 
00007ffd`dd5fd9d6 720d                 jb      ntdll!RtlGuardIsValidStackPointer+0x3d (7ffddd5fd9e5) //---+ 
                                                                                                          |
00007ffd`dd5fd9d8 493b4b10             cmp     rcx, qword ptr [r11+10h]                                   |
00007ffd`dd5fd9dc 7707                 ja      ntdll!RtlGuardIsValidStackPointer+0x3d (7ffddd5fd9e5)      |
00007ffd`dd5fd9de b801000000           mov     eax, 1                                                     |
00007ffd`dd5fd9e3 eb24                 jmp     ntdll!RtlGuardIsValidStackPointer+0x61 (7ffddd5fda09)      |
                                                                                                          |
00007ffd`dd5fd9e5 483b442420           cmp     rax, qword ptr [rsp+20h]                              //<--+ (2) rax=00000018d8e00000, rsp=00000000006f8888, [0x6f8888+20h]=00000018d8d80000
00007ffd`dd5fd9ea 7419                 je      ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05)

00007ffd`dd5fd9ec 482d00100000         sub     rax, 1000h                                            // (3) rax=00000018d8e00000
00007ffd`dd5fd9f2 483bc8               cmp     rcx, rax                                              // rcx=00000000006f8ff0, rax=00000018d8dff000
00007ffd`dd5fd9f5 720e                 jb      ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05) //---+
00007ffd`dd5fd9f7 483b4c2410           cmp     rcx, qword ptr [rsp+10h]                                   |
00007ffd`dd5fd9fc 7707                 ja      ntdll!RtlGuardIsValidStackPointer+0x5d (7ffddd5fda05)      |
00007ffd`dd5fd9fe b801000000           mov     eax, 1                                                     |
00007ffd`dd5fda03 eb04                 jmp     ntdll!RtlGuardIsValidStackPointer+0x61 (7ffddd5fda09)      |
00007ffd`dd5fda05 eb00                 jmp     ntdll!RtlGuardIsValidStackPointer+0x5f (7ffddd5fda07) //<--+ return 0 
00007ffd`dd5fda07 33c0                 xor     eax, eax
00007ffd`dd5fda09 c3                   ret   
hzongaro commented 2 months ago

@a7ehuo, is there anything the JIT (or the JVM) can do about this? Moving this out to 0.49 for now. . . .

a7ehuo commented 2 months ago

@a7ehuo, is there anything the JIT (or the JVM) can do about this? ...

Unfortunately, I have not found anything that the JIT (or the JVM) can do about this case when CFG is enabled

bogdanrech commented 2 months ago

Hi! I've also seen this using tomcat 9.0.94. So it's not only on tomcat 10. For now I just disabled CFG from Exploit programs in Windows Settings.

JamesKingdon commented 1 week ago

Hi @a7ehuo , I see this PR getting referenced from cases so I just wanted to check on a couple of things. If the problem happens because the Java SP is outside of the bounds of the C SP listed in the TEB, this sounds like it should be the normal case for most of our compiled code. Is there any reason to think it would be limited to a small number of jit'd methods? If the problem stems from the implicit NP checks, would -Xrs:sync be an effective work-around?

JamesKingdon commented 1 week ago

Oh, looks like it's not implicit nullchecks but a modified explicit nullcheck? I read more and found your reference to #5708 and the suggestion to use disableTraps.

a7ehuo commented 1 week ago

If the problem happens because the Java SP is outside of the bounds of the C SP listed in the TEB, this sounds like it should be the normal case for most of our compiled code. Is there any reason to think it would be limited to a small number of jit'd methods?

This is also what my understanding is, but I don't know why it limits to a number of JIT'd methods

If the problem stems from the implicit NP checks, would -Xrs:sync be an effective work-around?

Here are a fews of workarounds