Java image compiled on Fedora 22 doesn't work

nyh commented 9 years ago

When the Java image (either "scripts/build image=java" or "scripts/build image=openjdk8-fedora", or something which uses it, e.g., "scripts/build image=tomcat") is compiled on Fedora 22, the resulting image crashes. For example:

$ scripts/build image=java
$ scripts/run.py -V -e 'java.so -version'
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000200000600060, pid=0, tid=18446603336284278848
#
# JRE version:  (7.0_55-b13) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.51-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [7.0_55-b13+0x4fe050]
#
# Core dump written. Default location: //core or core.0
#
# An error report file with more information is saved as:
# /tmp/jvm-0/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#
Aborted

[backtrace]
0x0000100001bd1ea8 <???+29171368>
0x0000100001d595fe <???+30774782>
0x0000100001bda4f6 <JVM_handle_linux_signal+550>
0x0000000000375fb6 <call_signal_handler+326>
0x000000000037eb7d <???+3664765>
0x00001000019ddedb <???+27123419>
0x0000100001cdf46a <???+30274666>
0x0000100001ce3d14 <???+30293268>
0x00001000019ddb18 <???+27122456>
0x00001000019a15cc <???+26875340>
0x0000100001d0d73e <???+30463806>
0x0000100001a10246 <JNI_CreateJavaVM+102>
0x0000100000c0dab6 <???+12638902>
0x0000100000c0e378 <???+12641144>
0x00000000004e537f <execute_native_thread_routine+31>
0x00000000003d9a46 <thread_main_c+38>
0x000000000037eb75 <???+3664757>
0x19131409020015ff <???+33560063>
0x00000000003d974f <???+4036431>
0xfb89485354415540 <???+1413567808>

Gdb shows the following stack trace:

#4  0x0000000000225c40 in abort () at runtime.cc:96
#5  0x0000100001bd1ea9 in os::abort(bool) ()
#6  0x0000100001d595ff in VMError::report_and_die() ()
#7  0x0000100001bda4f7 in JVM_handle_linux_signal ()
#8  0x0000000000375fb7 in call_signal_handler (frame=0x2000004ff0c0)
    at arch/x64/signal.cc:85
#9  <signal handler called>
#10 0x0000200000600060 in ?? ()
#11 0x000010000199489d in AbstractICache::invalidate_range(unsigned char*, int)
    ()
#12 0x00001000019ddedc in AbstractInterpreterGenerator::generate_all() ()
#13 0x0000100001cdf46b in TemplateInterpreterGenerator::generate_all() ()
#14 0x0000100001ce3d15 in TemplateInterpreter::initialize() ()
#15 0x00001000019ddb19 in interpreter_init() ()
#16 0x00001000019a15cd in init_globals() ()
#17 0x0000100001d0d73f in Threads::create_vm(JavaVMInitArgs*, bool*) ()
#18 0x0000100001a10247 in JNI_CreateJavaVM ()
#19 0x0000100000c0dab7 in java_main (argc=<optimized out>, argv=0x2000003ffdd0)`
#20 0x0000100000c0e379 in operator() (argv=<optimized out>, 
    argc=<optimized out>, __closure=<optimized out>) at java/java.cc:247
#21 _M_invoke<0ul, 1ul> (this=<optimized out>)
    at /usr/include/c++/5.1.1/functional:1531
#22 operator() (this=<optimized out>) at /usr/include/c++/5.1.1/functional:1520
#23 std::thread::_Impl<std::_Bind_simple<main(int, char**)::<lambda(int, char**)>(int, char**)> >::_M_run(void) (this=<optimized out>)
    at /usr/include/c++/5.1.1/thread:115
#24 0x00000000004e5380 in execute_native_thread_routine ()
#25 0x000000000042f836 in operator() (__closure=0xffffa00003989f00)
    at libc/pthread.cc:99
#26 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/5.1.1/functional:1871
#27 0x00000000003d9a47 in sched::thread_main_c (t=0xffff800003c29050)
    at arch/x64/arch-switch.hh:164
#28 0x000000000037eb76 in thread_main () at arch/x64/entry.S:113

nyh commented 9 years ago

The crash is identical (and doesn't provide any new information) when OSv is compiled with no optimizations ("mode=debug").

copumpkin commented 9 years ago

I'm getting something very similar with a JVM compiled on NixOS :frowning: @nyh did you ever find out what was wrong here?

copumpkin commented 9 years ago

I may be misinterpreting the stack trace you posted there, but judging from what I can tell, what matters here is probably entry 10:

#9  <signal handler called>
#10 0x0000200000600060 in ?? ()
#11 0x000010000199489d in AbstractICache::invalidate_range(unsigned char*, int)

which looks like the last thing that happened before the segfault. Staring at what seems like a related version of the openjdk code, I'm guessing the failure came from call_flush_stub(start, round_to(nbytes, ICache::line_size) >> ICache::log2_line_size);, which then calls a function pointer and might be leading to the segfault.

Assuming I'm right, that function pointer should be assigned by this line, which I think takes us to this code:

void ICacheStubGenerator::generate_icache_flush(ICache::flush_icache_stub_t* flush_icache_stub) {
  StubCodeMark mark(this, "ICache", "flush_icache_stub");

  address start = __ pc();
#ifdef AMD64

  const Register addr  = c_rarg0;
  const Register lines = c_rarg1;
  const Register magic = c_rarg2;

  Label flush_line, done;

  __ testl(lines, lines);
  __ jcc(Assembler::zero, done);

  // Force ordering wrt cflush.
  // Other fence and sync instructions won't do the job.
  __ mfence();

  __ bind(flush_line);
  __ clflush(Address(addr, 0));
  __ addptr(addr, ICache::line_size);
  __ decrementl(lines);
  __ jcc(Assembler::notZero, flush_line);

  __ mfence();

  __ bind(done);

#else
  const Address magic(rsp, 3*wordSize);
  __ lock(); __ addl(Address(rsp, 0), 0);
#endif // AMD64
  __ movptr(rax, magic); // Handshake with caller to make sure it happened!
  __ ret(0);

  // Must be set here so StubCodeMark destructor can call the flush stub.
  *flush_icache_stub = (ICache::flush_icache_stub_t)start;
}

And that appears to be doing some dark voodoo :smile: can anyone see anything obviously wrong with my reasoning? If not, is there anything in the code in generate_icache_flush that might not like different compilation environments?

copumpkin commented 9 years ago

Upon further examination, I think the code above is doing dynamic code generation from pseudo-assembly encoded in C++. As such, it seems like the sort of thing that might normally call mprotect or do weird memory things.

Of course, this might all be based on a false assumption! I'll update if I find anything concrete.

copumpkin commented 9 years ago

Yeah, so I've verified my hunch was correct: the pc=0x0000200000600060 described in the message is the result of the dynamic code generation and the crash happens on this line:

  int r = (*_flush_icache_stub)(start, lines, auto_magic);

since _flush_icache_stub = 0x0000200000600060. Still trying to figure out what causes it. In a more conventional OS, it seems like something might be missing a call to mprotect on the memory in question, but I don't know enough aboute OSv to know if that makes sense here.

copumpkin commented 9 years ago

By calling mprotect on the _flush_icache_stub page(s) with PROT_READ | PROT_EXEC (it returned successfully, or at least appeared to), I managed to change the pc value that the segfault occurred at to 0x0000100001a19e9f. This strongly suggests that the segfault reported in this issue occurs simply because the dynamically generated assembly is not marked executable; when I make it executable, something else goes wrong so presumably we get over one hump.

I haven't yet started to look at how OpenJDK normally calls mprotect on its allocated memory to get around this sort of thing, so I still have no clue why this would be failing for my NixOS OpenJDK as well as @nyh's Fedora 22 one, but not for the older Fedora OpenJDK. I welcome any ideas! If none appear, I'll keep digging over the next few days.

nyh commented 9 years ago

Hi @copumpkin, very interesting analysis! You can try to modify access_fault() in core/mmu.cc to print a message when the first if() returns true (i.e., is_page_fault_insn(error_code) and !(perm & perm_exec)), and know for sure if the sigsegv came because we tried to execute non-executable code.

The weird thing is that the same Java executable, without recompilation, works when OSv was compiled with an old compiler, but not when it was compiled with a newer compiler. I don't see how this can be explained by a Java bug (like forgetting to mprotect something) - sounds more like an OSv bug somewhere (but I have no idea where).

copumpkin commented 9 years ago

That is weird, you're right. I didn't realize that the compiler we build OSv with had any impact. I misunderstood your original point as "the Fedora 22 OpenJDK doesn't work, but the older OpenJDK does work"!

I'll try downgrading my compiler in my NixOS test case and see if I can replicate that aspect of it, too. Will also try adding a print to mmu.cc regardless. Are there any gotchas I should be aware of when printing from that code? Can I just import iostream and use cout?

nyh commented 9 years ago

Yes, the thing is, the "java" module (and also the "openjdk8-fedora") module are both contained pre-compiled versions of Java, not something which we compile on the current machine, so why does it matter if the compilation machine is Fedora 22, 21, or NixOS?

But now that I am giving this a second thought, there is one potential problem area:

OpenJDK is a C++ program, so it is compiled with libstdc++. The specific executable we have in the "java" module was compiled, if I remember correctly, for Fedora 20, but now we run it on OSv which includes the libstdc++ compiled into it - i.e., on my Fedora 22 machine, it's a much newer libstdc++. It is possible that the Java code compiled with the old libstdc++.so simply cannot run against the new libstdc++.so. It's sad if this is true (libstdc++ is supposed to be backward-compatible), but it might be true.

To test if this is the explanation, you can try to build an OSv image not with the pre-packaged version of Java, but rather with the version of Java installed on your machine. I think this command will do this:

scripts/build jdkbase=/usr/lib/jvm image=java

Does the image built with this work for you? (I don't have Fedora 22 at the moment, so I cannot check this myself now)

nyh commented 9 years ago

Unfortunately, "scripts/build jdkbase=/usr/lib/jvm image=java" fails if the given jdkbase is a symbolic link (this should be fixed, but another time). So if this build fails with some symbolic-link error, please replace in the above command the "/usr/lib/jvm" the actual directory this points to:

scripts/build jdkbase=`readlink -f /usr/lib/jvm/java` image=java

copumpkin commented 9 years ago

Yes, the thing is, the "java" module (and also the "openjdk8-fedora") module are both contained pre-compiled versions of Java, not something which we compile on the current machine, so why does it matter if the compilation machine is Fedora 22, 21, or NixOS?

I'm actually not using those built-in modules (possibly part of my problem! but it does seem to match your problem when you are using them)

As a result, I'm already doing it with a "consistent" system. With my Nix, the OpenJDK I'm using is the one from Nix, and I'm compiling OSv in the same environment with the same C++ stdlib.

That's actually how I've been able to test my hypotheses about OpenJDK's internals: I just recompiled OpenJDK with some debugging statements merged in, because I don't yet know how to debug a running OSv box or get a core dump out of it :smile:

copumpkin commented 9 years ago

I tried rebuilding OSv (and boost, which otherwise seemed to cause OSv build failures) with gcc 4.8.5 (was using 4.9 before) and still get the same failure :frowning:

Very mysterious, all this :8ball:

nyh commented 9 years ago

I confirmed that the "jdkbase" thing suggested above does not help (the resulting image still fails on Fedora 22) - so the guess about a libstdc++.so mismatch was wrong. So we still don't know what this bug is, but we now know what it is not ;-)

nyh commented 9 years ago

With the following patch, I confirmed your observation that the Java crash indeed happens when it tried to execute a memory page which is only read-write, but not executable:

--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -1179,6 +1179,9 @@ bool access_fault(vma& vma, unsigned int error_code)
 {
     auto perm = vma.perm();
     if (mmu::is_page_fault_insn(error_code)) {
+       if (!(perm & perm_exec)) {
+               printf("access_fault(): tried to execute non-executable page.perm = 0x%x\n", perm);
+       }
         return !(perm & perm_exec);
     }

However, like you also noted, if I make every page executable, with the following patch:

--- a/arch/x64/arch-mmu.hh
+++ b/arch/x64/arch-mmu.hh
@@ -136,7 +136,8 @@ pt_element<N> make_pte(phys addr, bool leaf, unsigned perm = perm_rwx,

     pte.set_valid(perm != 0);
     pte.set_writable(perm & perm_write);
-    pte.set_executable(perm & perm_exec);
+    //pte.set_executable(perm & perm_exec);
+    pte.set_executable(true);
     pte.set_dirty(true);
     pte.set_large(large);
     pte.set_addr(addr, large);

I still get a SIGSEGV. This time it's not an attempt to execute a non-executable page, but this time adding prints to vm_fault() I see this is an attempt to write (ec==2) to a null pointer (addr==0), which is of course not possible. This write attempt happens from the same instruction address you saw: ef->get_pc() is 0x200000600060 (as Java also prints). So apparently, the generated-code is question is now executable, but contains broken code (trying to write to a null pointer).

I stopped investigating at this point.

copumpkin commented 9 years ago

Very strange! Don't you just love runtime code generation? :smile:

While I have you: is there a guide for interactive debugging in OSv? My "add print statements and recompile OpenJDK" thing works, but is frustratingly slow. Or if that sort of debugging isn't feasible, is there an easy way to get the core dump out of the image?

nyh commented 9 years ago

On Wed, Sep 2, 2015 at 4:22 PM, Daniel Peebles notifications@github.com wrote:

While I have you: is there a guide for interactive debugging in OSv? My "add print statements and recompile OpenJDK" thing works, but is frustratingly slow. Or if that sort of debugging isn't feasible, is there an easy way to get the core dump out of the image?

No core dump of the image, but there's a fairly easy to connect to a running OSv with gdb. Our wiki document about debugging is a good start on how to do that:

https://github.com/cloudius-systems/osv/wiki/Debugging-OSv

But a lot of times, adding printouts inside OSv and rebuilding the image (which often takes around 10 seconds) is not too frustratingly slow, and is also a useful debugging method.

Nadav Har'El nyh@cloudius-systems.com

copumpkin commented 9 years ago

https://github.com/cloudius-systems/osv/wiki/Debugging-OSv

D'oh, thanks :smile:

But a lot of times, adding printouts inside OSv and rebuilding the image (which often takes around 10 seconds) is not too frustratingly slow, and is also a useful debugging method.

My main concern here is that I'm going to be poking around inside the OpenJDK itself, which takes far longer to rebuild. I'd much rather just make a debug build of OpenJDK and attach a debugger!

copumpkin commented 9 years ago

The other nice thing about gdb is that I can point it at the broken PC and easily disassemble the memory there :smile: if it doesn't match what the code that should be generated in https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/cpu/x86/vm/icache_x86.cpp#L31-L69, then I can go dig into why the runtime assembler is broken.

Make sense?

copumpkin commented 9 years ago

Another update in the saga:

I attached gdb and asked it to disassemble the region in question, and it looks legitimate and like the assembler is producing the expected code.

(gdb) disas 0x0000200000600060,+40
Dump of assembler code from 0x200000600060 to 0x200000600088:
   0x0000200000600060:  test   esi,esi
   0x0000200000600062:  je     0x200000600079
   0x0000200000600068:  mfence 
   0x000020000060006b:  clflush BYTE PTR [rdi]
   0x000020000060006e:  add    rdi,0x40
   0x0000200000600072:  dec    esi
   0x0000200000600074:  jne    0x20000060006b
   0x0000200000600076:  mfence 
   0x0000200000600079:  mov    rax,rdx
   0x000020000060007c:  ret    
   0x000020000060007d:  (bad)
   ...

Given that the code is correct, my only remaining avenue is to figure out why the memory starts with the wrong permissions on it (I assume some part of the runtime code generator normally takes care of mprotect-ing the code to be executable) and hope that whatever causes that to fail is also causing what @nyh is observing when he marks all memory +x.

nyh commented 9 years ago

This is all very mysterious. The only reason I could think of which could cause such symptoms on unchanged Java code is a bug in OSv's tlb flush code, where Java remapped 0x0000200000600060 to a new page (with the correct contents you saw) with new permissions (executable), but somehow the mmap()/mremap()/mprotect()/? call did not flush the tlb (nor use invlpg) so the processor is unaware of the changed page table - and when we later try to execute from 0x0000200000600060 we use the TLB and either find the permissions of the old page or its old address with bad content.

Our TLB-flush calls were fairly well debugged, but I'll take a look at it again, and try to think if something in the newer gcc could have broke it or expose an existing bug...

copumpkin commented 9 years ago

Mysterious indeed! Any suggestions on how to try diagnosing this? For now, I've been studying the OpenJDK code generation pipeline to see how it normally expects to end up with executable memory. Still working through it.

copumpkin commented 9 years ago

My (long and painful) investigation into the OpenJDK's memory behavior has revealed that all memory it uses (or at least the memory we care about) passes through the following call:

int os::Linux::commit_memory_impl(char* addr, size_t size, bool exec) {
  int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
  uintptr_t res = (uintptr_t) ::mmap(addr, size, prot,
                                   MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0);
  /// ...
}

And if I log the JVM startup on my "Fedora 22" (NixOS actually) OSv I see that the following calls are made:

commit_memory_impl(0x200000502000, 32768, 0)
commit_memory_impl(0x200000400000, 12288, 0)
commit_memory_impl(0x200000600000, 2555904, 1)
commit_memory_impl(0x200000600000, 40960, 0)

You'll note two things: we ask for 0x200000600000 to be mapped with PROT_EXEC (what I'd expect), and we also ask for 0x200000600000 to be mapped again afterwards, without it.

If I watch my same OpenJDK build run on the host Linux, I see instead:

commit_memory_impl(0x7ffff7ef0000, 12288, 0)
commit_memory_impl(0x7fffe1000000, 2555904, 1)
commit_memory_impl(0x7ffff54c5000, 40960, 0)

so it's using the same chunk sizes, but that last non-executable mapping of 40960 is at a different address instead of at the same one as the executable one.

I see two possibilities from here:

mmap(..., ... | MAP_FIXED, ... | PROT_EXEC, ...) is giving me non-executable memory in the commit_memory_impl(0x200000600000, 2555904, 1) call [less likely, I think]
mmap is behaving fine, and the OpenJDK logic that decides on the memory layout it wants is somehow getting tricked (by an OSv/compiler bug?) into overlapping two distinct memory regions that shouldn't normally overlap [more likely]

Am I missing any options? My current plan is to track down how OpenJDK decides on its memory layout and to pursue option #2 above. If not, I'll see if I can minimize the mmap issue that #1 suggests.

copumpkin commented 9 years ago

@nyh I've attempted to minimize the memory-specific weirdness I described in #2 by emulating the OpenJDK memory allocation patterns, but am not getting the same behavior. I'm wondering if you know of something that I'm missing that might be changing the behavior.

All of the below is on Fedora 22-style OSv:

Here's what I see in OpenJDK:

// Some initial calls to mmap are omitted up here, and don't seem interesting. Leaving the interesting ones below.

// Ask for some new memory from the OS
mmap(0, 0xF000000, 0x0, MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS, -1, 0) => 0x200000600000

// Ask for a chunk of the memory we just got to be RWX (the 0x7 below)
mmap(0x200000600000, 0x270000, 0x7, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0) =>  0x200000600000

// Ask for another chunk of memory from the OS
mmap(0, 3932160, 0x0, MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS, -1, 0) => 0x200000600000

// ...but wait! That's the same address we got last time! :(

// Mark that memory RW (which undoes the RWX we had above, since the addresses are the same)
mmap(0x200000600000, 40960, 0x3, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0) => 0x200000600000

In an effort to reproduce the patterns of allocation I saw in OpenJDK, I wrote this tiny program:

#include <stdio.h>
#include <sys/mman.h>

int main(int argc, char *argv[]) {
  void *memx = mmap(0, 0x100000, 0x1, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  printf("memx:\t%p\n", memx);

  void *mem0 = mmap(0, 4096, 0x1, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  printf("mem0:\t%p\n", mem0);

  void *mem1 = mmap(0, 4096, 0x3, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  printf("mem1:\t%p\n", mem1);

  void *mem2 = mmap(0, 32768, 0x0, MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS, -1, 0);
  printf("mem2:\t%p\n", mem2);
  mem2 = mmap(mem2, 12288, 0x3, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0);
  printf("mem2:\t%p\n", mem2);

  memx = mmap(memx, 12288, 0x3, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0);
  printf("memx:\t%p\n", memx);

  printf("----------------------------------\n");

  void *mem3 = mmap(0, 0xF000000, PROT_NONE, MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS, -1, 0);
  printf("mem3:\t%p\n", mem3);
  mem3 = mmap(mem3, 0x270000, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, -1, 0);
  printf("mem3:\t%p\n", mem3);

  void *mem4 = mmap(0, 0x3C0000, PROT_NONE, MAP_PRIVATE | MAP_NORESERVE | MAP_ANONYMOUS, -1, 0);
  printf("mem4:\t%p\n", mem4);

  return 0;
}

In OSv, this program runs as expected, and produces this output:

mmap(0, 1048576, 0x1, 0x22, -1, 0) = 0x200000400000
memx:   0x200000400000
mmap(0, 4096, 0x1, 0x22, -1, 0) = 0x200000500000
mem0:   0x200000500000
mmap(0, 4096, 0x3, 0x22, -1, 0) = 0x200000501000
mem1:   0x200000501000
mmap(0, 32768, 0x0, 0x4022, -1, 0) = 0x200000502000
mem2:   0x200000502000
mmap(0x200000502000, 12288, 0x3, 0x32, -1, 0) = 0x200000502000
mem2:   0x200000502000
mmap(0x200000400000, 12288, 0x3, 0x32, -1, 0) = 0x200000400000
memx:   0x200000400000
----------------------------------
mmap(0, 251658240, 0x0, 0x4022, -1, 0) = 0x200000600000
mem3:   0x200000600000
mmap(0x200000600000, 2555904, 0x7, 0x32, -1, 0) = 0x200000600000
mem3:   0x200000600000
mmap(0, 3932160, 0x0, 0x4022, -1, 0) = 0x20000f600000
mem4:   0x20000f600000

Note that the last mmap call producing mem4 produces a distinct address from mem3, unlike what we see in OpenJDK. I also made sure to trace calls to munmap and there were none.

So my question for you is: do you have any ideas what else might affect the behavior of the mmap call that my small repro is missing?

copumpkin commented 9 years ago

Another thing I considered that I want to put out there in case I'm missing something:

The OSv implementation of mmap has some special logic for handling JVM heaps. It looks like it short circuits in some cases and calls return_jvm_heap, which I could envision leading to us getting the same address multiple times. However, the special JVM heap handling code appears to be dead to me. It's all contingent on jvm_heap_size not being 0, and the only code to set it to anything but 0 is commented out in java.cc.

I was hoping that it might be some sort of linking weirdness around jvm_heap_size like the stdout symbol issue, but given that the code appears dead, I don't think that's the case.

copumpkin commented 9 years ago

I think I'm getting closer. For reference, my test setup right now consists of two side-by-side VMs, one running Fedora 21 and the other running Fedora 22. Decided to put my NixOS stuff on hold while I could isolate the problem in known environments.

As I suggested in my previous posts (sorry for being so chatty), this seems to be the key difference:

OpenJDK on Fedora21/OSv produces two calls to mmap that deal with the executable region 0x200000600000:

Allocates the region
Uses MAP_FIXED to make a part of it executable

OpenJDK on Fedora22/OSv produces four calls to mmap that deal with the executable region:

Allocates the region
Uses MAP_FIXED to make a part of it executable
Allocates another region but gets the same address back
Uses MAP_FIXED to make the same part of it NX

Some things I'm noticing when poking around the MMU code:

The allocated regions (outside of 0x200000600000) differ slightly from F21 and F22. They seem pretty deterministic within each version though. Might indicate some initialization difference?
With F21, after mmu::map_anon returns 0x200000600000, when I go to make the second allocation, the 0x200000600000 now appears in vma_list, as it should based on common sense and a call to vma_list.insert(*v).
With F22, on the other hand, after mmu::map_anon returns that address, by the time we make the second allocation call, ~~vma_list does not contain 0x200000600000, so we reuse the region and return it again~~ [edit: not true, I was misreading my diagnostics; looking into what's happening instead].

~~I'm now trying to track down how vma_list loses its 0x200000600000 entry in F22. Fun times!~~

copumpkin commented 9 years ago

tl;dr: I've minimized a possible source of the bug!!

So I've minimized at least one source of some of the weird stuff I'm seeing that's leaving vma_list in an odd state on F22. Here's the program:

#include <stdio.h>
#include <cstdint>
#include <algorithm>
#include <vector>
#include <boost/intrusive/set.hpp>

namespace bi = boost::intrusive;

class vma {
public:
    vma(uintptr_t start, uintptr_t end) : _start(start), _end(end) {}
    class addr_compare;
    uintptr_t _start;
    uintptr_t _end;
    boost::intrusive::set_member_hook<> _vma_list_hook;
};

class vma::addr_compare {
public:
    bool operator()(const vma& x, const vma& y) const { return x._end <= y._start; }
};

class vma_compare {
public:
    bool operator ()(const vma& a, const vma& b) const { return a._start < b._start; }
};

typedef boost::intrusive::set<vma,
                              bi::compare<vma_compare>,
                              bi::member_hook<vma,
                                              bi::set_member_hook<>,
                                              &vma::_vma_list_hook>,
                              bi::optimize_size<true>
                              > vma_list_base;

int main() {
    vma_list_base vma_list;
    vma_list.insert(*new vma(0x100000a08000, 0x100000a09000));
    vma_list.insert(*new vma(0x100000a09000, 0x100000a0a000));
    vma_list.insert(*new vma(0x200000000000, 0x200000001000));

    vma r(0x100000a08000, 0x100000a0a000);

    auto range = vma_list.equal_range(r, vma::addr_compare());

    printf("range: (%p, %p) -- (%p, %p)\n", range.first->_start, range.first->_end, range.second->_start, range.second->_end);

    return 0;
}

If I run that in an image compiled on F21, I get the following output:

range: (0x100000a08000, 0x100000a09000) -- (0x200000000000, 0x200000001000)

but on F22, we instead see this:

range: (0x100000a08000, 0x100000a09000) -- (0x100000a09000, 0x100000a0a000)

Note how the range returned is smaller in F22. As far as I've been able to tell, this leads to ismapped returning false when it shouldn't and all sorts of ensuing weirdness. I don't know for sure if this is the cause of all the issues, but it is definitely causing some strangeness.

I wonder if it's a boost version issue causing different behavior in boost's equal_range? Does the comparison function we have for vma not respect all their preconditions and thus we're hitting some undefined behavior? The only differences that are left, as far as I can tell, are boost (1.55.0-8 to 1.57.0-6), and gcc.

Oddly enough, if I use std::equal_range on a std::vector<vma> or even on the vma_list_base, I get the F21 behavior I observe with boost, so it looks very much like an issue with boost's equal_range.

copumpkin commented 9 years ago

I've reproduced the behavior change in equal_range (boost's intrusive one, not the std one or the generic boost one) on an unrelated Linux system, with an older gcc, when I change from boost 1.55 to boost 1.56 and later. The addr_compare ordering for vma values feels a little weird, but I think in practice it obeys the strict partial order requirements, so I'm inclined to believe that this might be a boost bug. Will poke around some more to see if the vma ordering is in violation of a precondition and is touching undefined behavior.

A short-term fix might be to just replace the MMU usages of addr_compare with code that explicitly does what we expect it to.

nyh commented 9 years ago

@copumpkin - very nice progress. If this turns out to be a Boost bug or non-backward-compatibility, I'd be surprised but not too surprised because this is not the first Boost bug or incompatible changes we've seen while using Boost in OSv :-(

copumpkin commented 9 years ago

@nyh thanks! Quite an adventure of a bug hunt :smile:

will probably spend another little bit reducing the bug to figure out if it's Boost or OSv doing something wrong, but meanwhile I've been able to unblock my NixOS builds by just pointing them at boost 1.55!

Either way, after I figure out the specific issue, I'll update here and we can decide what to do to fix OSv on boost 1.56 and up (Fedora 22 etc.)

copumpkin commented 9 years ago

@nyh after poking around at Boost some more, I'm not sure it's worth investing much more time in this. I can't spot any violations of the strict partial order laws or other stated preconditions needed to use equal_range. I'll probably just submit it to their bug tracker and see if anyone has any input.

On the OSv side, it seems easier not to rely on something so subtle and just replace the uses of equal_range with a custom function that does specifically what's needed. It doesn't seem like it would be very hard, but I'm failing to even distill the desired behavior from the odd addr_compare ordering and the equal_range spec.

Is the spec simply to find all vma objects that fall within the supplied addr_range? Must they be contiguous? If so, I think there are other efficient boost::intrusive ways to do that without us having to manually traverse the tree.

I'd be tempted to also remove uses of vma_list.find(..., vma::addr_compare()) in case they also suffer from similar weirdness (if not now, then perhaps in a future update).

nyh commented 9 years ago

I think you've indeed pinpointed the bug:

On Tue, Sep 15, 2015 at 5:34 AM, Daniel Peebles notifications@github.com wrote:

So I've minimized at least one source of some of the weird stuff I'm seeing that's leaving vma_list in an odd state on F22. Here's the program:

If I run that in an image compiled on F21, I get the following output:

range: (0x100000a08000, 0x100000a09000) -- (0x200000000000, 0x200000001000)

This is indeed the expected output: Note that the second range in this printout is not one of the "equal" (i.e., intersecting) ranges, but rather the one just after the last equal one, in other words, "equalrange" returns two iterators, where the second one is the first one after the range. This is also what the std::equal_range documentation says (the Boost documentation is really hard to find...).

but on F22, we instead see this:

range: (0x100000a08000, 0x100000a09000) -- (0x100000a09000, 0x100000a0a000)

I can reproduce this too!

Interestingly, if instead of using equal_range(), you use lower_bound() and upper_bound() separately, you get the correct (F21) response, so it seems like a serious bug in equal_range(). I'll file a boost bug report and see what they think about this.

And we should indeed stopping to use equal_range()... I'm not sure what to suggest though - using upper_bound() and lower_bound() separately might be slightly less efficient, because it needs to look for the right place twice, but I doubt in practice this will be noticable. I'll take a look again at the code using equal_range() and try to think of the best alternative. Gleb, do you have any opinion on this?

nyh commented 9 years ago

I filed a Boost bug report - https://svn.boost.org/trac/boost/ticket/11701 Let's see what they think of this. If it's a real bug (it certainly looks like one!), even if they fix it, we won't be able to count on their fix if we want OSv to be usable on existing systems, so we'll need to work around it by not using equal_range() with a comparator. As I noted in the Boost bug report, lower_bound() and upper_bound() seem to still be correct, so we can easily use those instead of equal_range().

By the way, note that the reason to use lower_bound() and upper_bound(), or both of them with equal_range(), and not a more straightforward iteration on the elements, is that because the set is actually a tree, these functions can be implemented with logarithmic complexity, and not in linear complexity that a naive iteration will get you.

avikivity commented 9 years ago

If they fix it , we can get the fix backported to Fedora, they are fairly responsive (for boost).

We can also commit it to our local repo.

copumpkin commented 9 years ago

Cool, thanks! We might get closure! :smile:

copumpkin commented 9 years ago

Would it make sense to just change vma_list to be an interval_map? It's not intrusive, but otherwise it seems to support what we need. It does support overlapping intervals, which are not what the allocator needs, but we could just not use that functionality. I doubt it'd be a significant performance change, right?

nyh commented 9 years ago

I finally have a patch which fixes Java when OSv is compiled with Boost 1.57 (e.g., Fedora 21). And it still works on Boost 1.55 (Fedora 21). I'll clean up the patch and post to the mailing list tomorrow.

The patch could have been be as simple as changing the use of equal_range() to lower_bound() and upper_bound(), which does work as expected today. However, given the discussion in the Boost bug tracker, we have no guarantee that those will continue to work in the future either, because upper_bound() may change. In fact anything which uses vma::addr_compare is likely not to continue working in the future because the boost guys consider this use illegal.

So what I did instead was to have more eleborate code which uses uses lower_bound() (only) on the set without a second non-set-compatible comparator, to find separately the beginning of the range and the end of the range. The code is kind of nasty, and a little bit slower, but continues to be logarithmic (O(lnN)) in number of VMAs, and best of all, works on Boost 1.57.

As I said, I'll post the patch tomorrow.

nyh commented 9 years ago

@copumpkin, I just sent a patch (a series of 2, actually) to the osv-dev mailing list to fix this bug. Can you please test it and confirm it also fixes the bug that you saw? Thanks.

copumpkin commented 9 years ago

@nyh that's great news, thanks! I won't be able to test it until Saturday or Sunday, but I look forward to it. Just by eyeballing, your patch makes sense to me and seems like it would fix it. Is there a way to unit test it sensibly?

nyh commented 9 years ago

On Fri, Oct 2, 2015 at 5:16 AM, Daniel Peebles notifications@github.com wrote:

@nyh https://github.com/nyh that's great news, thanks! I won't be able to test it until Saturday or Sunday, but I look forward to it. Just by eyeballing, your patch makes sense to me and seems like it would fix it. Is there a way to unit test it sensibly?

Should have been possible to come up with a test in tst-mmap.cc which does something like mmapping several consecutive regions (with different permissions so they can't be joined?), and then munmap-ing the range containing all of them and then trying to mmunmap again the individual regions and expecting all of them to fail with EINVAL. Or something like that... I'm not sure whether this bug fails predictibly, so I don't know how easy this will be to reproduce it. Want to try? I think you were the closest person to seeing this bug occur in actual VMAs.

Thanks, Nadav.

cloudius-systems / osv

Java image compiled on Fedora 22 doesn't work #635