Infinite recursion causes a crash on FreeBSD

battleblow commented 5 years ago

This program:

public class InfiniteRecursion {
    public static void main(String[] args) {
        main(args);
    }
}

causes a VM crash on FreeBSD. On Linux, this results in a thrown StackOverflowError instead and a graceful exit. The same is true on OpenBSD, so this seems FreeBSD specific.

The crash on FreeBSD is due to a segfault:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000804a6c61e, pid=24543, tid=100836
#
# JRE version: OpenJDK Runtime Environment (11.0.4) (slowdebug build 11.0.4-internal+0-adhoc.glewis.openjdk-jdk11u)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 11.0.4-internal+0-adhoc.glewis.openjdk-jdk11u, mixed mode, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# j  InfiniteRecursion.main([Ljava/lang/String;)V+0
#
# Core dump will be written. Default location: /usr/home/glewis/projects/java/jdk11/openjdk-jdk11u/java.core
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

I expect this to cause compliance test failures, although I haven't confirmed this yet. I've experimented with altering the stack size using -Xss, due to some earlier bug reports, but that doesn't change the behaviour.

My test environment is FreeBSD 11.3/amd64. I haven't tried other versions or architectures yet.

bsdkurt commented 5 years ago

@battleblow Most likely the root cause can be traced back into the signal handling in the src/hotspot/os_cpu/bsd_<cpu>/os_bsd_<cpu>.cpp file. Roughly speaking, the jdk puts two levels of guard pages the the end of the stack (yellow zone, red zone), when a java program hits the end of the stack, the signal handling in that file it unprotects the yellow zone converts it into java.lang.StackOverflowError. You may want to try reverting my recent changes to the file to rule that out. Also perhaps the large page support may be related so a test with that reverted too may help rule that out as well.

battleblow commented 5 years ago

Agreed. I've been looking into the stack and signal handling code a little, but the confirmation is useful. Thank you.

The original report was actually for OpenJDK 8u144, I just wanted to verify it on 11 and found the same problem was still occurring. So at this point my thought is that it is something that has carried through for longer rather than a recent change. I will check out the changes you mention though.

battleblow commented 5 years ago

Working through the code block starting at

https://github.com/battleblow/openjdk-jdk11u/blob/master/src/hotspot/os_cpu/bsd_x86/os_bsd_x86.cpp#L509

The signal is caught, but fails the if (thread->in_stack_yellow_reserved_zone(addr)) { check, meaning that the stack overflow does not get triggered.

battleblow commented 5 years ago

Confirming the above. Adding || true (as a temporary test of the hypothesis) does result in a StackOverflowError. So it looks like the detection of the address being within the yellow zone is what is broken on FreeBSD

battleblow commented 5 years ago

Yellow zone check is a simple check that the address is less than the reserved base address but higher than the red zone base address. Kurt's comments above and the diagram in thread.hpp is useful in visualising this.

We're failing the first part of that check, i.e. the address is higher than the reserved base address. Basically, according to the JVM, we haven't yet dropped out of the standard thread stack and into the guard area. However, FreeBSD does believe we're accessing an address which is outside of the stack (hence the SEGV). Possible cause is a misalignment between the pthread idea of the stack and the JVM idea of it. That seems most likely. Less likely, we have the address wrong?

Note that the address is 4080 above the reserved base. This is suspiciously (to me) just less than 4K above it.

battleblow commented 5 years ago

Suspicious because this machine has a page size of 4K. This is also the default size of the guard region in the FreeBSD pthread library. So my suspicion is that we're misaligned in that FreeBSD is using it's default guard region, which the JVM doesn't expect since it believes it is managing the guard region. So we grow into the FreeBSD guard region, a SIGSEGV is thrown and the JVM concludes that we're not in the guard area yet and doesn't convert it to a stack overflow.

battleblow commented 5 years ago

Potentially relevant links in the Linux and AIX code:

https://github.com/battleblow/openjdk-jdk11u/blob/master/src/hotspot/os/linux/os_linux.cpp#L817 (read through to the call to pthread_attr_setguardsize())

https://github.com/battleblow/openjdk-jdk11u/blob/master/src/hotspot/os/aix/os_aix.cpp#L900

battleblow commented 5 years ago

Trying to use the AIX code, i.e. setting the guard size to zero for java and compiler threads, doesn't appear to make any difference. Note that the stack overflow check does check that the thread is a Java thread, so it is true that the thread involved here is a Java thread. The hs_err file also confirms this.

battleblow commented 5 years ago

Interestingly, after doing some debugging, the address of the thread being examined is not one being created in os_bsd.cpp. It is the main thread created separately in

https://github.com/battleblow/openjdk-jdk11u/blob/master/src/java.base/unix/native/libjli/java_md_solinux.c#L736

However, this clearly has a guard size set to zero.

battleblow commented 5 years ago

A couple of other notes:

Causing the infinite recursion to happen inside of a JVM created thread also reproduces the problem. So this isn't specific to the main thread.
Reading through the stack allocation code in thr_stack.c for FreeBSD's pthread implementation it doesn't appear to suffer the problem that Linux's NPTL threads do per the comment in os_linux.cpp.

battleblow commented 5 years ago

Inspected the code for other OSes, and the stack overflow detection logic is the same for Linux, AIX, and Solaris. So far, only FreeBSD appears to suffer from this problem and I can't find anything about the FreeBSD logic for stack sizing and set up which is different though.

Consistently the address is within a page size of the guard boundary. Will try printing out similar information on OpenBSD and see what it looks like.

battleblow commented 5 years ago

For the internal JDK test suite, this looks like it is the root cause of failures in the following tests:

test/jdk/java/lang/invoke/condy/CondyNestedResolutionTest.java
test/jdk/vm/runtime/ReflectStackOverflow.java
test/langtools/jdk/jshell/ExceptionsTest.java

bsdkurt commented 5 years ago

This could be a bug with pthread_attr_get_np. If that is not returning accurate info, the jvm will place the guard pages at the wrong locations and may result in the behavior you have described.

https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/hotspot/os/bsd/os_bsd.cpp#L4341

battleblow commented 5 years ago

Agreed. Comparing the math between the JVM and the FreeBSD pthread implementation is proving to be a little difficult, at least for me. Thanks for pointing out that section though, I'll add some additional trace output in that location and see if that brings some clarity.

battleblow commented 5 years ago

Here is an ascii picture of the stack of the crashing thread:

JVM thread stack

JVM Guard pages = JVM reserved zone + JVM yellow zone + JVM red zone (Total size: 16384)

-- <- Guard page bottom (libthr)
|
| Libthr guard pages (4096)
|
-- <- Stack bottom (libthr) (0x7fffdedee000|140736932536320) (Total size: 1048576)
|
| JVM Red pages (Size: 4096)
|
-- <- Stack red zone base (0x7fffdedef000|140736932540416)
|
| JVM Yellow pages (Size: 8192)
|
-- <- Stack yellow zone base (0x7fffdedf0000|140736932544512)
|
| JVM Reserved pages (Size: 4096)
|
-- <- Stack reserved zone base (0x7fffdedf2000|140736932552704)
|
| Fault address (0x7fffdedf2ff8|140736932556792) (4088 outside the reserved zone)
|
| Untouched memory + shadow zone + stack frames
|
-- <- Stack base (0x7fffdeeee000|140736933584896)

This matches predominantly the diagram in thread.hpp, which means it ends up being numerically reversed.

The key observation is that it all matches up. The stack size from libthr matches the difference in addresses between the stack address (bottom) from libthr and the stack base (top) from the JVM. Although this shouldn't be surprising since that is how the JVM calculates the stack base, IIRC. The JVM guard sizes and addresses also all match up (although the base address accessor methods in thread.hpp just do match based on the size of the relevant zones, so this doesn't say a lot).

So, despite not being able to find an error in the math, and the numbers seeming reasonable (stack size is 1M, addresses are page aligned, sizes are page multiples), the fault address is clearly outside the JVM guard zone (by less than a page).

I'm unsure how to handle this. I could, on FreeBSD, round the address up to the nearest page boundary (there are already utility methods to do this). This might be a hack though, since I don't have a deep enough understanding of how this works to be confident that it is a warranted step. Even so, it might be worthwhile for now while I dig into it deeper. I'd like to see the address on OpenBSD and see whether it is page aligned, since that would give me more confidence in such a change if it is, but my OpenBSD VM is causing some issues right now. Maybe I'll give that another try tomorrow.

battleblow commented 5 years ago

Unsurprisingly this works. I.e. modifying the address as follows:

#ifdef FreeBSD
  addr = align_down(addr, os::vm_page_size());
#endif

Results in stack overflow being thrown correctly on FreeBSD. The question is whether this is a reasonable thing to do.

bsdkurt commented 5 years ago

Nice analysis. I'm still on vacation so I can only make a guess right now. From your description I'm assuming that the address for Stack reserved zone base would be the last address that would not fault and Stack reserved zone base - 1 would. Based on what you have described it appears that an extra page has been mprotected PROT_NONE; the page at address for Stack reserved zone base. Can you do a syscall trace on the the test program and review the mprotect(2) calls on the stack?

battleblow commented 5 years ago

That's a good hypothesis. I really should have included mprotect(2) calls in what I was looking at.

I need to look at this again in the morning, but it seems ok at the moment.

The overflowing stack has a bottom address of 0x7fffdedf7000. The following look to be the relevant mprotect calls:

 2325 100356: 1.521600108 mprotect(0x7fffdeef7000,4096,PROT_NONE) = 0 (0x0)
 2325 101377: 1.597886795 mprotect(0x7fffdedf7000,16384,PROT_NONE) = 0 (0x0)
 2325 101377: 1.605237197 mprotect(0x7fffdedf8000,12288,PROT_READ|PROT_WRITE) = 0 (0x0)
 2325 101377: 1.605704001 mprotect(0x7fffdedf8000,12288,PROT_NONE) = 0 (0x0)

which looks to be:

Stack red zone (PROT_NONE)
Stack guard area (PROT_NONE)
Stack reserved + yellow zone (PROT_READ|PROT_WRITE)
Stack reserved + yellow zone (PROT_NONE)

At this point I don't see any other mprotect calls for this stack. The fault address in this case is 0x7fffdedfbfd8 and the stack base (top) is 0x7fffdeef7000.

battleblow commented 5 years ago

I looked this through again today, and these appear to be the only relevant mprotect calls. They all appear to only cover the JVM guard pages, which makes sense. I don't know why there are two initial calls (one for the red zone and then one for the whole guard zone). That seems like the first call is a waste, but it shouldn't cause issues nevertheless.

I'm going to post my findings to one of the FreeBSD mailing lists and see if anyone has any insight. I'll likely patch the ports tree with the hack above during that though.

bsdkurt commented 5 years ago

I wrote a standalone test program and believe I identified the root cause. Manually setting the guard pages on FreeBSD now results in what appears to be an automatic additional guard page being inserted by mmap, or perhaps it doesn't allow a MAP_STACK region to be adjacent to a !MAP_STACK region. I haven't dug into the FreeBSD mmap code to figure that part out.

Here is the standalone test program:

bsd_stack_test.c

build with clang -g -O0 bsd_stack_test.c -pthread and debug as follows:

(gdb) handle SIGSEGV nostop noprint
Signal        Stop  Print   Pass to program Description
SIGSEGV       No    No  Yes     Segmentation fault
(gdb) b 91
Breakpoint 1 at 0x21058c: file bsd_stack_test.c, line 92.
(gdb) r
Starting program: /usr/home/ec2-user/stack/a.out 
[New LWP 100487 of process 52800]
stack bottom:   0xffffbfdfe000
stack guard:    0xffffbfe02000
stack top:  0xffffbfffe000
SIGSEGV addr:   0xffffbfe02fff
expecting no SIGSEGV. stack guard failed
[Switching to LWP 100487 of process 52800]

Thread 2 hit Breakpoint 1, thread_start (arg=0x0) at bsd_stack_test.c:92
92    if (check_bounds(bottom, top)) {
(gdb) shell
[ec2-user@freebsd ~/stack]$ procstat -v 52800
  PID              START                END PRT  RES PRES REF SHD FLAG  TP PATH
<snip non-stack regions>
52800     0xffffbfdfe000     0xffffbfe02000 ---    0    0   0   0 ----- -- 
52800     0xffffbfe02000     0xffffbfe03000 ---    0    0   0   0 ----- -- 
52800     0xffffbfe03000     0xffffbfe1e000 rw-   27   27   1   0 ---D- df 
52800     0xffffbfe1e000     0xffffbfe3e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfe3e000     0xffffbfe5e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfe5e000     0xffffbfe7e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfe7e000     0xffffbfe9e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfe9e000     0xffffbfebe000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfebe000     0xffffbfede000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfede000     0xffffbfefe000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfefe000     0xffffbff1e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbff1e000     0xffffbff3e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbff3e000     0xffffbff5e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbff5e000     0xffffbff7e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbff7e000     0xffffbff9e000 rw-   32   32   1   0 ---D- df 
52800     0xffffbff9e000     0xffffbffbe000 rw-   32   32   1   0 ---D- df 
52800     0xffffbffbe000     0xffffbffde000 rw-   32   32   1   0 ---D- df 
52800     0xffffbffde000     0xffffbfffe000 rw-   32   32   1   0 ---D- df 
52800     0xffffbfffe000     0xffffbffff000 ---    0    0   0   0 ----- -- 
52800     0xffffbffff000     0xfffffffdf000 ---    0    0   0   0 ----- -- 
52800     0xfffffffdf000     0xfffffffff000 rw-    3    3   1   0 ---D- df 
52800     0xfffffffff000    0x1000000000000 r-x    1    1  20   0 ----- ph

The above shows that there is a page at 0xffffbfe02000 that is mapped PROT_NONE, but the program didn't modify that stack page. The jvm is expecting the stack to be able to grow into that page and hit our manual stack guard region. This manual guarding technique used to work on FreeBSD, but now does not.

Perhaps you could raise this issue with FreeBSD kernel devs. I have not found a method to get around this issue.

bsdkurt commented 5 years ago

Here's a possible work-around that fixes the InfinteRecursion test case but needs more testing:

diff --git a/src/hotspot/os/bsd/os_bsd.cpp b/src/hotspot/os/bsd/os_bsd.cpp
index 87e8b7a20d..9eea54524f 100644
--- a/src/hotspot/os/bsd/os_bsd.cpp
+++ b/src/hotspot/os/bsd/os_bsd.cpp
@@ -2239,6 +2239,11 @@ bool os::pd_uncommit_memory(char* addr, size_t size) {
 }

 bool os::pd_create_stack_guard_pages(char* addr, size_t size) {
+#ifdef __FreeBSD__
+  // XXX hack to account for automatic PROT_NONE page at end of stack
+  if (size > Bsd::page_size())
+    size -= Bsd::page_size();
+#endif
   return os::commit_memory(addr, size, !ExecMem);
 }

bsdkurt commented 5 years ago

I noticed this code:

https://github.com/freebsd/freebsd/blob/master/lib/libthr/thread/thr_stack.c#L276

which just does a mprotect() over the bottom of MAP_STACK memory which I found didn't work in my testing. I adjusted the test program to demonstrate that pthread_attr_setguardsize() is also not working correctly on FreeBSD/aarch64 13-current at least.

pthread_attr_setguardsize_test.c

battleblow commented 5 years ago

FWIW, here is what that test program reports on my FreeBSD 11.3/amd64 box:

> ./a.out 
stack bottom:   0x7fffdfdfe000
stack top:      0x7fffdfffe000
SIGSEGV addr:   0x7fffdfdfefff
expecting SIGSEGV between 0x7fffdfdfd000 and 0x7fffdfdfe000. pthread_attr_setguardsize failed
pthread_attr_setguardsize failed

When I was testing this earlier I tried testing with a change taken from the AIX code to set the pthread guard size to zero (since the guard is unused by the JVM) and it made no difference. I also tried different stack sizes via -Xss (which only affects Java threads) and that also made no difference.

Here is a test Java program that I've also been using to actually test this in a spawned thread rather than the "main" thread:

public
class ThreadInfiniteRecursion {

    public static
    void main(String[] args) {
        Thread t = new Thread(new RecursiveThread());
        t.start();
        try {
            t.join();
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static
    class RecursiveThread
        implements Runnable {

        public
        void run() {
            run();
        }
    }
}

bsdkurt commented 5 years ago

I created the pthread_attr_setguardsize_test.c program to demonstrate that mprotect() on top of MAP_STACK pages doesn't work as expected. On aarch64 it segfaulted below the bottom, but not directly below the bottom of the stack - just on the page at bottom - 16k which I assumed is the kernel placed guard for the MAP_STACK region:

[ec2-user@freebsd ~/stack]$ ./a.out 
stack bottom:   0xffffbfdfe000
stack top:  0xffffbfffe000
SIGSEGV addr:   0xffffbfdfafff
expecting SIGSEGV between 0xffffbfdfd000 and 0xffffbfdfe000. pthread_attr_setguardsize failed
pthread_attr_setguardsize failed

I'm surprised by your amd64 results which showed it segfaulted on the page at the bottom of the stack. I expected the pthread guard area to be 16K below bottom, how it segfaulted at the bottom page is strange. Could you change just the size to 0 in the pthread_attr_setguardsize() call and show the results on amd64?

bsdkurt commented 5 years ago

Reading the kernel code I see that it does indeed set a separate guard page for MAP_STACK regions which is controlled by the sysctl security.bsd.stack_guard_page:

https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L4102

There seems to be a series of bugs related to that functionality: 1) mprotect(PROT_NONE) on top of MAP_STACK pages not yet touched by the kernel fails to set the pages as PROT_NONE. 2) setting pthread_attr_setguardsize() size to < security.bsd.stack_guard_page number of pages results in pthread_attr_getstack() returning a stack bottom that is PROT_NONE due to kernel setting security.bsd.stack_guard_page number of pages as guard pages. 3) setting pthread_attr_setguardsize() size > security.bsd.stack_guard_page number of pages doesn't work correctly as it relies on 1 which doesn't work. The result is a stack that can access the guard pages (except the bottom security.bsd.stack_guard_page number of pages).

I haven't verified all of these yet but from my current understanding of the code I think these issues exist. I'll do some additional testing to verify they are indeed problems.

edit: fix code ticks

bsdkurt commented 5 years ago

Confirmed: Using the test pthread_attr_setguardsize_test program from the previous comment:

pthread_attr_setguardsize_test.c

Results when setting security.bsd.stack_guard_page before running the test: sudo sysctl -w security.bsd.stack_guard_page=4 test passes sudo sysctl -w security.bsd.stack_guard_page=1 test fails with stack < bottom accessible sudo sysctl -w security.bsd.stack_guard_page=5 test fails with page at bottom not accessible

battleblow commented 5 years ago

FWIW, here is the output of the test program with setguardsize called with 0. It sounds like you've already got the root cause though :)

> ./a.out 
stack bottom:   0x7fffdfdfe000
stack top:      0x7fffdfffe000
SIGSEGV addr:   0x7fffdfdfefff
expecting SIGSEGV between 0x7fffdfdfd000 and 0x7fffdfdfe000. pthread_attr_setguardsize failed
pthread_attr_setguardsize failed

battleblow commented 5 years ago

I guess one key of this is that both of the hacks only work with the default setting of the sysctl. So we'll definitely need a more substantive change to take account of it.

I think this is the reference for the sysctl addition:

https://svnweb.freebsd.org/base?view=revision&revision=215307

bsdkurt commented 5 years ago

Thanks. That matches what I expected.

I think the issues I outlined above need to addressed by FreeBSD devs before a proper solution can be implemented in the jvm. At a minimum some work is needed to make pthreads return proper results from pthread_attr_getstack() and pthread_attr_setguardsize() fixed so that it can set the number of guard pages correctly after the stack has been allocated using MAP_STACK. Its pretty important for userland to be able to place its own guard pages (in addition to the any kernel/pthreads ones) and remove them so that techniques like the jvm uses work.

Here's an ASCII picture of what I think might work well:

==============
-- <- stack top (high address) 
|
| MAP_STACK pages
|
-- <- stack bottom (low address)
|
| MAP_STACK pages set PROT_NONE by pthreads based on `pthread_attr_setguardsize()` 
|
-- <- stack bottom - guard size
|
| kernel placed guard pages based on `sysctl security.bsd.stack_guard_page`
|
-- <-  end of stack region

This differs from what is currently implemented in two ways:

1) Currently if the kernel receives an mmap MAP_STACK for size X, it returns back to userland a usable stack of size X - sysctl security.bsd.stack_guard_page. This is problematic as I don't believe it possible for pthreads to account for the kernel placed stack guard pages without race conditions. I also believe it breaks the expectation from user of mmap that the kernel returns a stack of size X that is usable. By ensuring that kernel placed guard pages are in addition to the requested size it allows pthreads to ignore them and always return proper values for pthread_attr_getstack()

2) Currently the kernel doesn't allow userland to place its own guard pages (unless the page has already been grown into by the application). This is problematic for both pthreads and userland programs that place their own guard pages like the jvm. Ideally the kernel would respect an mprotect(PROT_NONE) call on pages of the stack that have have not yet been grown into. To be clear the jvm needs to be able to place its own PROT_NONE pages on the stack and change some them back to PROT_READ|PROT_WRITE to allow further stack growth into that space (to allow exceptions to unwind), and then set them back to PROT_NONE again later.

With those two changes both pthreads and the jvm could ignore any kernel placed guard pages and place their own guard pages using mprotect(PROT_NONE). pthreads would allocate the stack using mmap with MAP_STACK flag and size of pthread_attr_setstacksize + pthread_attr_setguardsize, then place its own guard pages with mprotect(PROT_NONE). The values returned from pthread_attr_getstack() would be straightforward to calculate. Actually I think there would be no change to pthreads code as this is what it currently doing. Both of the above changes are kernel changes.

Hopefully this bug report is helpful to FreeBSD kernel developers on what changes would make the pthreads stack guard pages, pthread_attr_getstack() results and the jvm stack exceeded exception handling working well.

klaus4 commented 5 years ago

maybe of interest, no idea: Infinityisfinite.txt

bsdkurt commented 5 years ago

@battleblow The two workarounds we have don't allow the page that faulted to be made read/write because it is a kernel guard page. This prevents the jvm from using the the reserved area of the stack after the initial segfault. The relevant test for this is:

test/hotspot/jtreg/runtime/ReservedStack

More can be read about the reasoning for this approach here:

https://openjdk.java.net/jeps/270

I tried writing a third approach that consisted of two parts that tried to workaround the kernel's MAP_STACK behavior:

Part 1: Set pthread_attr_setguardsize size to match the security.bsd.stack_guard_page size so that the kernel set guard pages are not in the usable stack that pthread_attr_getstack() returns.
Part 2: Read the pages at the end of the stack to fault them in which makes the subsequent call to mprotect() work.

This works but, there's a massive TOCTOU issue that I realized as I was writing detailed comments for the approach. libthr caches thread stacks, so part 1 where checking the security.bsd.stack_guard_page size is used to avoid the kernel's guard pages is not feasible. Take for example, the jvm starts and security.bsd.stack_guard_page is set to 16, some threads are allocated and cached. Then later security.bsd.stack_guard_page is reduced. If those cached stacks are returned back to the jvm, the current value of security.bsd.stack_guard_page will be less then when it was allocated which would make the solution break (can't mprotect the kernel guard pages, cant read from them either).

I think we really need FreeBSD kernel devs to look into a solution for the issues created by the automatic placement of stack guard pages. The test program:

pthread_attr_setguardsize_test.c

should provide context into how the current approach is problematic for both libthr guard pages and the jvm.

For reference here's the third workaround (missing the sysctl querying and fixed at 4096):

 diff --git a/src/hotspot/os/bsd/os_bsd.cpp b/src/hotspot/os/bsd/os_bsd.cpp
index 87e8b7a20d..4eda3e83c8 100644
--- a/src/hotspot/os/bsd/os_bsd.cpp
+++ b/src/hotspot/os/bsd/os_bsd.cpp
@@ -796,6 +796,23 @@ bool os::create_thread(Thread* thread, ThreadType thr_type,
   int status = pthread_attr_setstacksize(&attr, stack_size);
   assert_status(status == 0, status, "pthread_attr_setstacksize");

+#ifdef __FreeBSD__
+  // XXX Part 1 of work-around to get os::guard_memory() call to succeed.
+  // Set guard size to match sysctl value so that libthr places its guard
+  // pages on top of kernel placed guard pages. The result is that pthreads
+  // will return back to us a stack region that does not include the kernel
+  // guard pages. However, this is racy (TOCTOU). A well timed change to the
+  // sysctl after we check it but before libthr allocate the stack will result
+  // in guard pages in the usable stack range and ... oh no this wont work.
+  // libthr caches the thread stacks, so we don't have a way to know how many
+  // pages to skip. e.g. TOCTOU window could be days for a long running
+  // java app like tomcat.
+  status = pthread_attr_setguardsize(&attr, 4096);
+#else
+  status = pthread_attr_setguardsize(&attr, 0);
+#endif
+  assert_status(status == 0, status, "pthread_attr_setguardsize");
+
   ThreadState state;

   {
@@ -2238,10 +2255,6 @@ bool os::pd_uncommit_memory(char* addr, size_t size) {
   return res  != (uintptr_t) MAP_FAILED;
 }

-bool os::pd_create_stack_guard_pages(char* addr, size_t size) {
-  return os::commit_memory(addr, size, !ExecMem);
-}
-
 // If this is a growable mapping, remove the guard pages entirely by
 // munmap()ping them.  If not, just call uncommit_memory().
 bool os::remove_stack_guard_pages(char* addr, size_t size) {
@@ -2317,6 +2330,27 @@ bool os::protect_memory(char* addr, size_t bytes, ProtType prot,
   return bsd_mprotect(addr, bytes, p);
 }

+bool os::pd_create_stack_guard_pages(char* addr, size_t size) __attribute__ ((optnone)) {
+#ifdef __FreeBSD__
+  // XXX Part 2 of work-around to get os::guard_memory() call to succeed.
+  // This first call to bsd_mprotect is not effective until the pages are
+  // touched. It is needed because the ContinueInNewThread0 thread ends
+  // up placing the guard pages twice and the second time it is called the pages
+  // are PROT_NONE.
+  bsd_mprotect(addr, size, PROT_READ|PROT_WRITE);
+
+  // XXX Danger, Will Robinson! Read from the pages so that they get
+  // paged in and the later call to os::guard_memory() will succeed.
+  // If we hit a kernel placed guard page this will crash the jvm.
+  for (char *pos = addr; pos < addr + size; pos += Bsd::page_size()) {
+    char c = *pos;
+  }
+  return true;
+#else
+  return os::commit_memory(addr, size, !ExecMem);
+#endif
+}
+
 bool os::guard_memory(char* addr, size_t size) {
   return bsd_mprotect(addr, size, PROT_NONE);
 }
diff --git a/src/java.base/unix/native/libjli/java_md_solinux.c b/src/java.base/unix/native/libjli/java_md_solinux.c
index 2d675062a4..c350a50b1b 100644
--- a/src/java.base/unix/native/libjli/java_md_solinux.c
+++ b/src/java.base/unix/native/libjli/java_md_solinux.c
@@ -766,7 +766,11 @@ ContinueInNewThread0(int (JNICALL *continuation)(void *), jlong stack_size, void
     if (stack_size > 0) {
       pthread_attr_setstacksize(&attr, stack_size);
     }
+#ifdef __FreeBSD__
+    pthread_attr_setguardsize(&attr, 4096);
+#else
     pthread_attr_setguardsize(&attr, 0); // no pthread guard page on java threads
+#endif

     if (pthread_create(&tid, &attr, (void *(*)(void*))continuation, (void*)args) == 0) {
       void * tmp;

edit: add missing portion of third workaround diff.

battleblow commented 5 years ago

Thanks Kurt. I really appreciate the in depth analysis here. The JEP reference was also very helpful in explaining the significance of the reserved area in the JVM stack and what it is used for.

I have been a bit bogged down with merges today, but tomorrow I'll create a PR and reference this thread. I'll also try and ping some kernel developers directly. I just need to sit down and read this through again to make sure I'm clear. The analysis and test program is a big help though on that front.

bsdkurt commented 5 years ago

Your welcome. Sounds good.

Another example of how security.bsd.stack_guard_page causes problems for pthreads would be creating a thread with pthread_attr_setstacksize(&attr, PTHREAD_STACK_MIN); and setting security.bsd.stack_guard_page to 2 or higher. In this case the pthread_create() call fails.

battleblow commented 5 years ago

I believe that setting security.bsd.stack_guard_page to 1 by default was done in response to

https://blog.qualys.com/securitylabs/2017/06/19/the-stack-clash

Here is the commit

https://svnweb.freebsd.org/base?view=revision&revision=320317

battleblow commented 5 years ago

Created https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239894 to start a discussion about this. I may also ping emaste and/or kib directly

bsdkurt commented 5 years ago

Thank you Greg,

Updated test programs:

1) Shows pthread_attr_setguardsize(3) is not currently working unless the size matches the security.bsd.stack_guard_page size. Try setting security.bsd.stack_guard_page to 1, 4, 5 and run program.

pthread_attr_setguardsize_test.c

Note: This is an updated version of the test previously posted here that removes an extraneous comment and adds __attribute__ ((optnone)) to check_stack function so the program works as intended regardless of opt level it was compiled at.

2) Shows pthread_create(3) fails if security.bsd.stack_guard_page size is greater then pthread_attr_setstacksize(3). Try setting Try setting security.bsd.stack_guard_page to 3 or higher and run program.

pthread_smallstack_test.c

battleblow commented 5 years ago

I'm going to resolve this issue. I think we've done the best we can from a Java perspective on handling this issue. We have changes in upstream FreeBSD and corresponding changes in the Java code that will prevent the problem in future releases, we have a known sysctl workaround for existing releases that allow them to behave correctly, and a workaround in the Java code that is a best effort for those who choose not to use the sysctl workaround.

battleblow / openjdk-jdk11u

Infinite recursion causes a crash on FreeBSD #51