Make -vm_size 2G by default for 64-bit

derekbruening commented 5 years ago

This is part of a series of 64-bit scalability improvements: xref the original reachability issue #774, splitting vmcode from vmheap #1132, and W^X #3556 which wants this 2G by default as a pre-req.

The idea is that now we have put all of our reachability-guaranteed code into its own region, we may as well reserve the maximum 2G amount at init time for that region, and avoid having to ever run out of space and spill over onto individual OS allocations and run into complications like #2115.

The complication is that we need client libraries (including extension libraries, but not third-party dependencies loaded by our private loader that do not interact directly with DR) to be inside that same 2G region. So we need to coordinate between DR's VMM and the file mapping done by the private loader. In particular, the VMM wants to reserve 2G up front and then hand a piece of it to the loader to map a file. While this is feasible on UNIX where we can MAP_FIXED on top of an existing mmap, or munmap just a piece of a prior large mmap (though that route has a race), it is not possible on Windows to map a file on top of an existing address space reservation. Nor is it possible to un-reserve just a piece of a reservation: the entire thing must be un-reserved.

Here are some possible solutions for Windows:

A) Pick vmcode range but don't reserve it, then load client libs + extensions in reachable spots, then separately reserve all the pieces in the vmcode range not occupied by libs? Would need a list of them to free at exit b/c have to use NtFreeVirtualMemory for the anon and NtUnmapViewOfSection for the files. And this would not work with post-init dr_map_file() with DR_MAP_CACHE_REACHABLE.
B) Long-term, abandon client reachability guarantees and DR_MAP_CACHE_REACHABLE?
C) Have client libs all in single space at top of 2G region? Might pay in non-preferred-base but this is 64-bit so even Windows libs should be partly PIC w/ fewer relocations. Ideally still load client libs first so don't need a max size. Pack them in from the top, then reserve the rest. Simply stop supporting dr_map_file() with DR_MAP_CACHE_REACHABLE??
D) Change Windows to do a separate reservation every, say, 640K? Then have flexibility to make holes in the vmm region. Would a separate call every 64K cost anything in kernel resources? Just extra time for 16M alloc syscalls? Or 160K for every 640K.
E) Combine C and D. Only support client libs + DR_MAP_CACHE_REACHABLE in the top 64M of the vmm region. Do one big reservation below that, and for that 64M, do 100 640K reservations, so we support 100 med-or-small files. In the 7.1.0 release, all extensions are <640K; dbghelp is 1.7M; debug drmemorylib is 7.2M (release 2.1M) so it would need 12 of the 100 slots?
F) Abandon 2G vmcode size on Windows. Keep it 512M, or 1G and update all preferred bases, by default and keep current search-from-OS code.

I'm going with F as the short-term solution: i.e., only implement 2G vmcode size by default for Linux, using probably MAP_FIXED, and leave it as future work to bring in Windows, even if that future work changes how Linux works as well.

derekbruening commented 5 years ago

A complication is that -vm_size 2G means it's impossible to have a statically linked client be reachable from the code cache, unless we really want to put the app executable itself inside the region: which would cause numerous issues with checks for DR vs app addresses. In the docs we don't really guarantee that a static client is reachable so maybe we should just explicitly state that it is not reachable (and auto-disable -reachable_client for a static client, assuming we can detect it)?

derekbruening commented 5 years ago

A related complication is that it's impossible to satisfy -vm_base_near_app. This means we would have to mangle all of the app's rip-rel instructions. We never measured the perf difference of -vm_base_near_app so we don't have any historical data to say how bad this would be.

We could still place the region near the app and try to get some of the rip-rel reaching: but we prefer not to place after the app, which messes with the brk (at least for non-PIE), and before the app our first-used low addresses are not going to reach the app's .data.

derekbruening commented 5 years ago

I added some better statistics and did some measurements. On some apps such as SPEC2006, there are very few rip-rel instructions in the app (under 100): libc has more (~600). Those are static counts.

On some larger proprietary apps, there are more: 20K static, accounting for ~2.5% of dynamic memory references.

On a synthetic benchmark where I made fully 50% of dynamic memory references rip-rel (about half loads and half stores), I measure substantial overhead differences wrt mangling:

Plain DR: 24% slowdown for being far way and having to mangle all the rip-rels.

Memtrace with no i/o: still 10% slowdown! This is surprising since just the memtrace instrumentation is a 13x slowdown on SPEC with no i/o.

Memtrace with i/o: still 3% slowdown! This one is even more surprising and should be examined further.

More analysis on actual overhead on real apps with significant rip-rel percentages would be ideal, but just based on these preliminary results my conclusion is that -vm_base_near_app is important. My proposal is to back off the 2G default and make it 1G. With the 512M ASLR, that will allow for -vm_base_near_app for an app binary <512M.

I plan to keep the functionality of loading the client inside the VMM as well as the loss of reachability guarantees for static clients.

For W^X #3556 the plan is to give the user a choice: simply fail if the 1G limit is reached (unlikely, given that vmcode is now split from vmheap), or give up -vm_base_near_app and set 2G up front. This seems a reasonable compromise.

I did raise the vmheap size (to 2G) by default as there is little downside to doing so.

derekbruening commented 5 years ago

Summarizing why this issue is still open: primarily for Windows support for loading client libs inside the vmm, which is required for large vmcode sizes such as 2G.

DynamoRIO / dynamorio

Make -vm_size 2G by default for 64-bit #3570