Open sbc100 opened 4 years ago
@awtcode @tlively
I spend a while looking into this yesterday and hit a bit of a road block.
The goal here is to build the main module without -fPIC
at compile time and without -pie
link time.
The problem is that as of today when you build without -fPIC
the codegen assume that all global addresses will be known at static link time (so it can always use i32.const <RELOC>
). Its only when we enable -fPIC
that we use the more conservative sequence global.get <GOT.RELOC>
. As it turns out what we want in this case non-position-independent executable that still uses the GOT for accessing non-local symbol addresses.
Basically we have conflated to concpets of PIC code and dynamically linked code (GOT usage for external symbols).
I am now leaning towards using GOT relocations conservatively in more cases (not just -fPIC
) ... maybe all cases? ... I think this is what llvm internally calls -m relocation-model=dynamic-no-pic
. I wonder if we should just make this the default? The downside is that it generates a lot of extra wasm globals in the static linking case, but these end up being const and can be completely eliminated by wasm-opt. One upside is that anyone who build with --allow-undefined
will now be able to supply values for all undefined functions and global addresses at runtime. As of today --allow-undefined
will turn undefined functions into imports but undefined data into zero addresses.
Another minor positive is that since we can now name globals in the name section it should make it disassembing unoptimized executables nicer.
Interesting! How does native code handle this situation? Can the relocation model be controlled by command line? Perhaps Emscripten could then opt into the more conservative mode rather than making it the LLVM default. Or we could make that conservative mode the LLVM default and folks who want the old behavior would be able to opt out.
Interesting...
How big is the extra overhead here, and how important is it to reduce?
Are you talking about the extra overhead introduced by building the MAIN_MODULE as RELOCTABLE today? Or the extra overhead of my proposal to enable -m relocation-model=dynamic-no-pic
by default?
I believe all the overhead introduced by -m relocation-model=dynamic-no-pic
can effectively be removed by wasm-opt
.
The overhead of using RELOCTABLE
for the MAIN_MODULE
is that the resulting binary needs to import things like memory
, __indirect_function_table
, __stack_point
and __heap_base
rather than having the static linker bake them in. The cost of importing this is mostly on the JS side where we need to maintain two different code paths and the JS glue size increases.
I see, thanks @sbc100
That does sound like a strong motivation to do it. I didn't realize there was a non-perf aspect here.
-m relocation-model=dynamic-no-pic
sounds reasonable. Two possible concerns:
wasm-opt
? (yes, they should run it, but not everyone does, and also in debug and fast iteration builds you don't want to run it)wasm-ld
? And the same questions for wasm-opt
for the case when it reduces the extra overhead.I see, thanks @sbc100
That does sound like a strong motivation to do it. I didn't realize there was a non-perf aspect here.
-m relocation-model=dynamic-no-pic
sounds reasonable. Two possible concerns:
- How bigger is the overhead for people not running
wasm-opt
? (yes, they should run it, but not everyone does, and also in debug and fast iteration builds you don't want to run it)
That is a good question. I'd would need to do some measurements but roughly speaking the cost would be:
global.get
rather than i32.const
The latter shouldn't effect code size, and since the global is a immutable it should probably not effect the runtime either since presumably any engine is going to treat global.get
of a const i32 global that same as i32.const
?
- Is there an impact to link time or memory used by the
wasm-ld
? And the same questions forwasm-opt
for the case when it reduces the extra overhead.
I can't imagine the impact of wasm-ld
performance would be interesting. If wasm-ld is a bottleneck there are probably plenty of lower hanging optimizations to be had there. For wasm-opt
I think that you might have a better idea.. I can't imaging it costs much to inline and eliminate the const globals in (1) and (2) above... WDYT?
Interesting! How does native code handle this situation?
Good question. I will investigate. I'm guessing if you build a native binary without -fPIC
the compiler can just embed a relocation directly into the code section so they can still use the equivalent of i32.const <&FOO>
and the dynamic linker can just patch the code section at runtime when the address of FOO is known. In the wasm world we need something like a global to achieve this indirection.
Can the relocation model be controlled by command line?
Yes
Perhaps Emscripten could then opt into the more conservative mode rather than making it the LLVM default. Or we could make that conservative mode the LLVM default and folks who want the old behavior would be able to opt out.
Indeed. I'm thinking that anything we do should probably be behind wasm32-unknown-emscripten.. at least to start with.
presumably any engine is going to treat global.get of a const i32 global that same as i32.const
Well, a baseline compiler would possibly not optimize this. Even an optimizing compiler might not if it optimizes functions in parallel first before looking at global state, but as the globals arrive first, it does seem like they could do this.
In the worst case this would replace a constant with a load from memory. That doesn't seem too bad.
For wasm-opt I think that you might have a better idea.. I can't imaging it costs much to inline and eliminate the const globals in (1) and (2) above... WDYT?
Yes, SimplifyGlobals
should do that optimization pretty efficiently already.
Hi, @sbc100 @kripken I think one big downside of building the MAIN_MODULE as RELOCTABLE is the extra instructions introduced to calculate the memory address by adding the memory base as like below:
(i32.add
(global.get $gimport$495) // $gimport$495 refer to the memory_base import
(i32.const 2064520)
)
This is not needed for MAIN_MODULE since the memory base is always 0. It increase the code size and slow the execution. The function pointer also has similar problem.
I wrote up a short design doc for how to move forward with this: https://docs.google.com/document/d/1viN3qTS5QzeDP7NR0pGg9D5JuuOrQzsORumMswYGpsA/edit?usp=sharing&resourcekey=0-2Rnysxch2EuXNT3cvoD1MA
Hi, @sbc100 @kripken I think one big downside of building the MAIN_MODULE as RELOCTABLE is the extra instructions introduced to calculate the memory address by adding the memory base as like below:
(i32.add (global.get $gimport$495) // $gimport$495 refer to the memory_base import (i32.const 2064520) )
This is not needed for MAIN_MODULE since the memory base is always 0. It increase the code size and slow the execution. The function pointer also has similar problem.
Indeed, this is of the primary motivators for this change. Thanks for pointing this out explicitly.
@sbc100, under the solution you wrote up, would there be a way to have some symbols directly imported and other symbols go through the GOT? I'm thinking of a situation in which some symbols are dynamically loaded from shared libraries but some are meant to be normal JS imports.
The GOT should only be used when symbol address are imported/exported. For first class functions they are imported in the same way as in the static build. This is true already for MAIN_MODULE/SIDE_MODULE builds.
However, if the address of a JS function is required it will indeed be imported as GOT.func.<name>
and assigned dynamically. Again this is just how the current ABI works, not changing anything.
After more experimenting I'm leaning towards not using any new flags or reocations models and just sticking to compiling with -fPIC
but not linking with -pie
.
This means that resulting binary will contain accessor that look like the above but against a constant base:
(i32.add
(global.get $__memory_base) // $__memory_base is an internal immutable global set to 1024
(i32.const 2064520)
)
I'm assuming that binaryen can take care of the relaxation of all of these to just:
(i32.const 2064520+1024)
Does that seem reasonable? (@kripken?)
Obviously it would better for wasm-ld to be perform this relaxation one day but we don't currently do any linker relaxation in wasm-ld so that would be a much bigger change.
@sbc100 Yes, Binaryen can do such optimizations.
This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.
Are you going to work on that? It'd help me a lot
@sbc100 It would good to implement it
We have a large project - porting large codebase from native code to WASM some code should be loaded on demand, to be able to use dlopen we should use MAIN_MODULE, wasm_apply_data_relocs gets enormous in main module making dynamic linking completely unusable
(we can port our codebase using porting dynamic libraries on startup but we get performance hit on startup, so it's logical to convert dynamic libraries on startup to static and then we get enormous wasm_apply_data_relocs in this case)
@Stuonts can you help me to understand what causes the explosion of data relocations. perhaps you can answer a few questions:
-fvisibility=hidden
? That that help?Hello @sbc100 , I am Stuonts's collegue.
Regarding -sMAIN_MODULE=2
.. it should work fine with -sSIDE_MODULE=1
. Can you try -sMAIN_MODULE=2
. It works best when you can pass the side module on the command line.. then the linker known which symbols that side module needs and it will keep those symbols alive.
Last time we tried it we had some kind of memory corruption issues that only went away when we switched back to MAIN_MODULE=1. I need to try again to give exact details
Hello @sbc100, so here are my tests for debug build(but with light optimization -O1) SIDE_MODULE=1 and MAIN_MODULE=1: func[2573] size=7828831 <__wasm_apply_data_relocs> SIDE_MODULE=1 and MAIN_MODULE=2: func[2462] size=7391787 <__wasm_apply_data_relocs> So it help a little bit, but we will be over limit very soon when we port some more code, so it is not a viable solution for us unfortunately.
I wonder if we can figure out why you have so many data relocations in your program?
I fear that the size of the __wasm_apply_data_relocs
being large also mean you have enormous imports section since these data relocations. You can attach the resulting .wasm file maybe so i can take a look? (preferable built with --profiling-funcs
)
@sbc100 I need to make sure I am allowed to attach .wasm file, because legal issues. I can give you sizes of imports/exports maybe or count of exported/imported functions. I can confirm that we have many exported classes, where every method is exported and classes have lots of methods.
@sbc100 So i run wasm-objdump --section=import --details
and it gave me this(for MAIN_MODULE=1 case):
Section Details:
Import[6798]:
So is 6798 imports "big"? I honestly can't don't know what is considered "normal" or "big" for wasm binary.
The last index of func is func[2664]
The last index of global is global[4130]
The export section looks very strange to me, wasm-objdump --section=export--details
gives
Section Details:
Export[30046]
Why does main binary has so many exports, I was assuming it is supposed to export only main function and thats pretty much it. And I certnainly didn't expect main binary to have 4-5 times more exports than imports.
Hello, @sbc100. I am trying to change __wasm_apply_data_relocs to generate loop so it doesn't grow with relocations size, as you written in your TODO here, because it is a major blocker for our big webassembly port https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 I am trying to create a static array of ints with relocations, but I assume I would have to relocate this array as well before I can read any relocation offsets from it? Also are there any plans to make main module non-relocatable in near future, as it would make changes to __wasm_apply_data_relocs unnecessary?
Sadly I don't think its possible to write this as a loop. This is because each of the symbol address is coming in as an imported global, and the global.get
instruction (that only way to access that value of a global) takes an immediate which the index of the global to be gotten. So you can write a loop like for i .. n; global.get n
since that would require a version of global.get that took an n
as a parameter (rather than an immediate in the instruction itself).
@sbc100 So here is a snippet of what I assume is code to apply one relocation:
(func $__wasm_apply_data_relocs (type 10)
i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store
I see global.get being used with $__memory_base, which is "internal immutable global set to 1024". But what about i32.const here? Could we for example use local.get and some local variable to read offset for specific relocation and add that to memory_base and then store it? Is it not possible to create array of offsets known at compile time and then go over that array, read value into local variable and have the snippet above in a loop, but have hardcoded constants replaced with values of local var into which we put offset for specific relocation?
Why do you say __memory_base
is an internal immutable global set to 1024? For MAIN_MODULE and SIZE_MODULE, I'm fairly sure __memory_base
is an imported global isn't it?
BTW, if a global really is internal and immutable then binaryen will convert global.get
to i32.const
and remove the global completely, at least in optimizing builds.
Also, the whole point of relocations is for the case when the binaryen don't know the memory base. If the memory base is known at link time you don't need any relocations.
The real problem for turning relocations into a loop is not __memory_base
. As you say, that value doesn't vary throughout the loop so can just be stored in a local. The problem is for imported addressed. For example, if you store the address of printf somewhere then the GOT.func.printf
global import is the way the dynamic linker tells us the final address of printf... in that case you need to go global.get GOT.function.printf
to find the address .. and you cannot do that in a data-driven loop because the GOT.function.printf
is an immedate to the global.get
instruction.
One thing that came up at the CG meeting last week was the idea of imported data segments. If relocations could be stored in data segments rather than globals, they could be copied into memory and iterated over. That idea is nowhere near close to being reality, though. Maybe this would be a useful follow-on proposal after extended const lands?
@sbc100 About __memory_base being internal const global, I just copied quote from this discussion midlessly, now I see how that doesn't make any sense, because unknown memory base is why we need relocations as you pointed out. I looked at WAT of various functions some more and what I wanted to try is to have something like this, imagine we have 2 generated arrays with size equal to number of relocations, where we would fill in respective constants that are used by i32.store in snippet I wrote above
static int offsets[] = { ... };
static int values[] = { ... };
Now __wasm_apply_data_relocs would look something like this
(func $__wasm_apply_data_relocs (type 10)
// first relocate offsets[] and values[] array like it is done right now, by adding constants to memory base
i32.const 28980160 // lets pretend it is "address" of offsets[] array
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678 // lets pretend $__memory_base + 8678 is value to write for address after relocation
i32.add
i32.store
// do the same for values[]
...
// now I assume we can read from offsets[] and values[], generate a loop
local.get 0 // where we put $__memory_base + offsets[loopCounter]
local.get 1 // where we put $__memory_base + values[loopCounter]
i32.store
This function would have constant size of 2 hardcoded relocs plus a loop, but we would have 2 extra int arrays with size equal to relocations count, which I think is well worth it?
The problem here is that the values[]
in this example is a list of symbol addresses, and symbol address are supplied the program as global imports, named after the symbol. e.g. GOT.func.printf
.
There is no way for the dynamic linker to fill in an array for symbol locations.
But my wasm_apply_data reloc is all like this
i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store
I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?
One thing that came up at the CG meeting last week was the idea of imported data segments. If relocations could be stored in data segments rather than globals, they could be copied into memory and iterated over. That idea is nowhere near close to being reality, though. Maybe this would be a useful follow-on proposal after extended const lands?
The problem is that there way the we have defined a symbol in the dynamic linker is a wasm global which stores its address (e.g. GOT.func.printf
or GOT.data.stdout
).
There is no relocation format that the dynamic linker understands, the relocations are all embedded/hidden in the binary and based off of global imports.
But my wasm_apply_data reloc is all like this
i32.const 28980160 global.get $__memory_base i32.add global.get $__memory_base i32.const 8678 i32.add i32.store ``But my wasm_apply_data reloc is all like this
i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store
I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?`
I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?
I see. Yes, for such internal relocations we could use a loop. I believe this is because the dynamic linker knows that those symbols are defined in the main module. For the side module I think you would see more GOT-based relocations (i.e. hasGOTIndex() would be true in the above code).
Given that I agree we could split the relocations into two type, internal and external. For internal relocations, we could then use a loop rather than repeating the code sequence.
Ok I lied, sorry, I didn't scroll long enough as my apply_data_reloc is ginormous. I do have GOT.func bits. I was going to suggest to at least apply this loop approach for bits that only need constants and do not need GOT.func, thanks!
If this is only applicable to main module it is fine, we only have size issues with apply_data_reloc in MAIN_MODULE. Side modules do not have this issue as they are much smaller
There is no need to the main module itself to be relocatable. We should be able to get some code size and performance wins by making it static.