Stop building MAIN_MODULE with RELOCATABLE (-fPIC)

sbc100 commented 4 years ago

There is no need to the main module itself to be relocatable. We should be able to get some code size and performance wins by making it static.

sbc100 commented 4 years ago

@awtcode @tlively

sbc100 commented 4 years ago

I spend a while looking into this yesterday and hit a bit of a road block.

The goal here is to build the main module without -fPIC at compile time and without -pie link time.

The problem is that as of today when you build without -fPIC the codegen assume that all global addresses will be known at static link time (so it can always use i32.const <RELOC>). Its only when we enable -fPIC that we use the more conservative sequence global.get <GOT.RELOC>. As it turns out what we want in this case non-position-independent executable that still uses the GOT for accessing non-local symbol addresses.

Basically we have conflated to concpets of PIC code and dynamically linked code (GOT usage for external symbols).

I am now leaning towards using GOT relocations conservatively in more cases (not just -fPIC) ... maybe all cases? ... I think this is what llvm internally calls -m relocation-model=dynamic-no-pic. I wonder if we should just make this the default? The downside is that it generates a lot of extra wasm globals in the static linking case, but these end up being const and can be completely eliminated by wasm-opt. One upside is that anyone who build with --allow-undefined will now be able to supply values for all undefined functions and global addresses at runtime. As of today --allow-undefined will turn undefined functions into imports but undefined data into zero addresses.

Another minor positive is that since we can now name globals in the name section it should make it disassembing unoptimized executables nicer.

tlively commented 4 years ago

Interesting! How does native code handle this situation? Can the relocation model be controlled by command line? Perhaps Emscripten could then opt into the more conservative mode rather than making it the LLVM default. Or we could make that conservative mode the LLVM default and folks who want the old behavior would be able to opt out.

kripken commented 4 years ago

Interesting...

How big is the extra overhead here, and how important is it to reduce?

sbc100 commented 4 years ago

Are you talking about the extra overhead introduced by building the MAIN_MODULE as RELOCTABLE today? Or the extra overhead of my proposal to enable -m relocation-model=dynamic-no-pic by default?

I believe all the overhead introduced by -m relocation-model=dynamic-no-pic can effectively be removed by wasm-opt.

The overhead of using RELOCTABLE for the MAIN_MODULE is that the resulting binary needs to import things like memory, __indirect_function_table, __stack_point and __heap_base rather than having the static linker bake them in. The cost of importing this is mostly on the JS side where we need to maintain two different code paths and the JS glue size increases.

kripken commented 4 years ago

I see, thanks @sbc100

That does sound like a strong motivation to do it. I didn't realize there was a non-perf aspect here.

-m relocation-model=dynamic-no-pic sounds reasonable. Two possible concerns:

How bigger is the overhead for people not running wasm-opt? (yes, they should run it, but not everyone does, and also in debug and fast iteration builds you don't want to run it)
Is there an impact to link time or memory used by the wasm-ld? And the same questions for wasm-opt for the case when it reduces the extra overhead.

sbc100 commented 4 years ago

I see, thanks @sbc100

That does sound like a strong motivation to do it. I didn't realize there was a non-perf aspect here.

-m relocation-model=dynamic-no-pic sounds reasonable. Two possible concerns:

How bigger is the overhead for people not running wasm-opt? (yes, they should run it, but not everyone does, and also in debug and fast iteration builds you don't want to run it)

That is a good question. I'd would need to do some measurements but roughly speaking the cost would be:

An extra wasm global for each address taken global data symbol or address taken function symbol.
At each address taking location we would use global.get rather than i32.const

The latter shouldn't effect code size, and since the global is a immutable it should probably not effect the runtime either since presumably any engine is going to treat global.get of a const i32 global that same as i32.const?

Is there an impact to link time or memory used by the wasm-ld? And the same questions for wasm-opt for the case when it reduces the extra overhead.

I can't imagine the impact of wasm-ld performance would be interesting. If wasm-ld is a bottleneck there are probably plenty of lower hanging optimizations to be had there. For wasm-opt I think that you might have a better idea.. I can't imaging it costs much to inline and eliminate the const globals in (1) and (2) above... WDYT?

sbc100 commented 4 years ago

Interesting! How does native code handle this situation?

Good question. I will investigate. I'm guessing if you build a native binary without -fPIC the compiler can just embed a relocation directly into the code section so they can still use the equivalent of i32.const <&FOO> and the dynamic linker can just patch the code section at runtime when the address of FOO is known. In the wasm world we need something like a global to achieve this indirection.

Can the relocation model be controlled by command line?

Yes

Perhaps Emscripten could then opt into the more conservative mode rather than making it the LLVM default. Or we could make that conservative mode the LLVM default and folks who want the old behavior would be able to opt out.

Indeed. I'm thinking that anything we do should probably be behind wasm32-unknown-emscripten.. at least to start with.

kripken commented 4 years ago

presumably any engine is going to treat global.get of a const i32 global that same as i32.const

Well, a baseline compiler would possibly not optimize this. Even an optimizing compiler might not if it optimizes functions in parallel first before looking at global state, but as the globals arrive first, it does seem like they could do this.

In the worst case this would replace a constant with a load from memory. That doesn't seem too bad.

For wasm-opt I think that you might have a better idea.. I can't imaging it costs much to inline and eliminate the const globals in (1) and (2) above... WDYT?

Yes, SimplifyGlobals should do that optimization pretty efficiently already.

fzhangtj commented 4 years ago

Hi, @sbc100 @kripken I think one big downside of building the MAIN_MODULE as RELOCTABLE is the extra instructions introduced to calculate the memory address by adding the memory base as like below:

(i32.add
     (global.get $gimport$495) // $gimport$495 refer to the memory_base import
     (i32.const 2064520)
)

This is not needed for MAIN_MODULE since the memory base is always 0. It increase the code size and slow the execution. The function pointer also has similar problem.

sbc100 commented 4 years ago

I wrote up a short design doc for how to move forward with this: https://docs.google.com/document/d/1viN3qTS5QzeDP7NR0pGg9D5JuuOrQzsORumMswYGpsA/edit?usp=sharing&resourcekey=0-2Rnysxch2EuXNT3cvoD1MA

sbc100 commented 4 years ago

Hi, @sbc100 @kripken I think one big downside of building the MAIN_MODULE as RELOCTABLE is the extra instructions introduced to calculate the memory address by adding the memory base as like below:
(i32.add
     (global.get $gimport$495) // $gimport$495 refer to the memory_base import
     (i32.const 2064520)
)
This is not needed for MAIN_MODULE since the memory base is always 0. It increase the code size and slow the execution. The function pointer also has similar problem.

Indeed, this is of the primary motivators for this change. Thanks for pointing this out explicitly.

tlively commented 3 years ago

@sbc100, under the solution you wrote up, would there be a way to have some symbols directly imported and other symbols go through the GOT? I'm thinking of a situation in which some symbols are dynamically loaded from shared libraries but some are meant to be normal JS imports.

sbc100 commented 3 years ago

The GOT should only be used when symbol address are imported/exported. For first class functions they are imported in the same way as in the static build. This is true already for MAIN_MODULE/SIDE_MODULE builds.

However, if the address of a JS function is required it will indeed be imported as GOT.func.<name> and assigned dynamically. Again this is just how the current ABI works, not changing anything.

sbc100 commented 3 years ago

After more experimenting I'm leaning towards not using any new flags or reocations models and just sticking to compiling with -fPIC but not linking with -pie.

This means that resulting binary will contain accessor that look like the above but against a constant base:

(i32.add
     (global.get $__memory_base) // $__memory_base is an internal immutable global set to 1024
     (i32.const 2064520)
)

I'm assuming that binaryen can take care of the relaxation of all of these to just:

(i32.const 2064520+1024)

Does that seem reasonable? (@kripken?)

Obviously it would better for wasm-ld to be perform this relaxation one day but we don't currently do any linker relaxation in wasm-ld so that would be a much bigger change.

kripken commented 3 years ago

@sbc100 Yes, Binaryen can do such optimizations.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 30 days. Feel free to re-open at any time if this issue is still relevant.

k1nder10 commented 2 years ago

Are you going to work on that? It'd help me a lot

Stuonts commented 2 years ago

@sbc100 It would good to implement it

We have a large project - porting large codebase from native code to WASM some code should be loaded on demand, to be able to use dlopen we should use MAIN_MODULE, wasm_apply_data_relocs gets enormous in main module making dynamic linking completely unusable

(we can port our codebase using porting dynamic libraries on startup but we get performance hit on startup, so it's logical to convert dynamic libraries on startup to static and then we get enormous wasm_apply_data_relocs in this case)

sbc100 commented 2 years ago

@Stuonts can you help me to understand what causes the explosion of data relocations. perhaps you can answer a few questions:

What version emscripten are you seeing this under?
Are you building with MAIN_MODULE=2 rather than 1? (if not can you see if this fixes the issue?)
Can you try building all of the code that lives in the main module with -fvisibility=hidden? That that help?

Maksim-Bozhko commented 2 years ago

Hello @sbc100 , I am Stuonts's collegue.

ver 3.1.15
We cannot use SIDE_MODULE=2 because we export a lot of classes with many methods and EMSCRIPTEN_KEEPALIVE doesn't work on classes, marking every method with EMSCRIPTEN_KEEPALIVE is not really feasible. Also removing symbols marked for export looks like a bug to me. And my understanding is we cannot mix SIDE_MODULE=1 with MAIN_MODULE=2, at least it didn't work when we tried.
We already build all the code with -fvisibility=hidden and we have export macro that adds compiler attribute that sets visibility to default for symbols we want to export. My understanding is dead code elimination can remove anything that is not exported and unused, but constructors of statics and static objects are hard to remove, because their constructors can have observable behavior.

sbc100 commented 2 years ago

Regarding -sMAIN_MODULE=2.. it should work fine with -sSIDE_MODULE=1. Can you try -sMAIN_MODULE=2. It works best when you can pass the side module on the command line.. then the linker known which symbols that side module needs and it will keep those symbols alive.

Maksim-Bozhko commented 2 years ago

Last time we tried it we had some kind of memory corruption issues that only went away when we switched back to MAIN_MODULE=1. I need to try again to give exact details

Maksim-Bozhko commented 2 years ago

Hello @sbc100, so here are my tests for debug build(but with light optimization -O1) SIDE_MODULE=1 and MAIN_MODULE=1: func[2573] size=7828831 <__wasm_apply_data_relocs> SIDE_MODULE=1 and MAIN_MODULE=2: func[2462] size=7391787 <__wasm_apply_data_relocs> So it help a little bit, but we will be over limit very soon when we port some more code, so it is not a viable solution for us unfortunately.

sbc100 commented 2 years ago

I wonder if we can figure out why you have so many data relocations in your program?

I fear that the size of the __wasm_apply_data_relocs being large also mean you have enormous imports section since these data relocations. You can attach the resulting .wasm file maybe so i can take a look? (preferable built with --profiling-funcs)

Maksim-Bozhko commented 2 years ago

@sbc100 I need to make sure I am allowed to attach .wasm file, because legal issues. I can give you sizes of imports/exports maybe or count of exported/imported functions. I can confirm that we have many exported classes, where every method is exported and classes have lots of methods.

Maksim-Bozhko commented 2 years ago

@sbc100 So i run wasm-objdump --section=import --details and it gave me this(for MAIN_MODULE=1 case): Section Details: Import[6798]: So is 6798 imports "big"? I honestly can't don't know what is considered "normal" or "big" for wasm binary. The last index of func is func[2664] The last index of global is global[4130] The export section looks very strange to me, wasm-objdump --section=export--details gives Section Details: Export[30046]

Why does main binary has so many exports, I was assuming it is supposed to export only main function and thats pretty much it. And I certnainly didn't expect main binary to have 4-5 times more exports than imports.

Maksim-Bozhko commented 2 years ago

Hello, @sbc100. I am trying to change __wasm_apply_data_relocs to generate loop so it doesn't grow with relocations size, as you written in your TODO here, because it is a major blocker for our big webassembly port https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 I am trying to create a static array of ints with relocations, but I assume I would have to relocate this array as well before I can read any relocation offsets from it? Also are there any plans to make main module non-relocatable in near future, as it would make changes to __wasm_apply_data_relocs unnecessary?

sbc100 commented 2 years ago

Sadly I don't think its possible to write this as a loop. This is because each of the symbol address is coming in as an imported global, and the global.get instruction (that only way to access that value of a global) takes an immediate which the index of the global to be gotten. So you can write a loop like for i .. n; global.get n since that would require a version of global.get that took an n as a parameter (rather than an immediate in the instruction itself).

Maksim-Bozhko commented 2 years ago

@sbc100 So here is a snippet of what I assume is code to apply one relocation:

(func $__wasm_apply_data_relocs (type 10)
  i32.const 28980160
  global.get $__memory_base
  i32.add
  global.get $__memory_base
  i32.const 8678
  i32.add
  i32.store

I see global.get being used with $__memory_base, which is "internal immutable global set to 1024". But what about i32.const here? Could we for example use local.get and some local variable to read offset for specific relocation and add that to memory_base and then store it? Is it not possible to create array of offsets known at compile time and then go over that array, read value into local variable and have the snippet above in a loop, but have hardcoded constants replaced with values of local var into which we put offset for specific relocation?

sbc100 commented 2 years ago

Why do you say __memory_base is an internal immutable global set to 1024? For MAIN_MODULE and SIZE_MODULE, I'm fairly sure __memory_base is an imported global isn't it?

BTW, if a global really is internal and immutable then binaryen will convert global.get to i32.const and remove the global completely, at least in optimizing builds.

Also, the whole point of relocations is for the case when the binaryen don't know the memory base. If the memory base is known at link time you don't need any relocations.

sbc100 commented 2 years ago

The real problem for turning relocations into a loop is not __memory_base. As you say, that value doesn't vary throughout the loop so can just be stored in a local. The problem is for imported addressed. For example, if you store the address of printf somewhere then the GOT.func.printf global import is the way the dynamic linker tells us the final address of printf... in that case you need to go global.get GOT.function.printf to find the address .. and you cannot do that in a data-driven loop because the GOT.function.printf is an immedate to the global.get instruction.

tlively commented 2 years ago

One thing that came up at the CG meeting last week was the idea of imported data segments. If relocations could be stored in data segments rather than globals, they could be copied into memory and iterated over. That idea is nowhere near close to being reality, though. Maybe this would be a useful follow-on proposal after extended const lands?

Maksim-Bozhko commented 2 years ago

@sbc100 About __memory_base being internal const global, I just copied quote from this discussion midlessly, now I see how that doesn't make any sense, because unknown memory base is why we need relocations as you pointed out. I looked at WAT of various functions some more and what I wanted to try is to have something like this, imagine we have 2 generated arrays with size equal to number of relocations, where we would fill in respective constants that are used by i32.store in snippet I wrote above

static int offsets[] = { ... };
static int values[] = { ... };

Now __wasm_apply_data_relocs would look something like this

(func $__wasm_apply_data_relocs (type 10)
  // first relocate offsets[] and values[] array like it is done right now, by adding constants to memory base
  i32.const 28980160 // lets pretend it is "address" of offsets[] array
  global.get $__memory_base
  i32.add
  global.get $__memory_base
  i32.const 8678 // lets pretend $__memory_base + 8678 is value to write for address after relocation
  i32.add
  i32.store
  // do the same for values[] 
  ...
  // now I assume we can read from offsets[] and values[], generate a loop
  local.get 0 // where we put $__memory_base + offsets[loopCounter]
  local.get 1 // where we put $__memory_base + values[loopCounter]
  i32.store

This function would have constant size of 2 hardcoded relocs plus a loop, but we would have 2 extra int arrays with size equal to relocations count, which I think is well worth it?

sbc100 commented 2 years ago

The problem here is that the values[] in this example is a list of symbol addresses, and symbol address are supplied the program as global imports, named after the symbol. e.g. GOT.func.printf.

There is no way for the dynamic linker to fill in an array for symbol locations.

Maksim-Bozhko commented 2 years ago

But my wasm_apply_data reloc is all like this

i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store

I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?

sbc100 commented 2 years ago

One thing that came up at the CG meeting last week was the idea of imported data segments. If relocations could be stored in data segments rather than globals, they could be copied into memory and iterated over. That idea is nowhere near close to being reality, though. Maybe this would be a useful follow-on proposal after extended const lands?

The problem is that there way the we have defined a symbol in the dynamic linker is a wasm global which stores its address (e.g. GOT.func.printf or GOT.data.stdout).

There is no relocation format that the dynamic linker understands, the relocations are all embedded/hidden in the binary and based off of global imports.

sbc100 commented 2 years ago

But my wasm_apply_data reloc is all like this


i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store
``But my wasm_apply_data reloc is all like this

i32.const 28980160
global.get $__memory_base
i32.add
global.get $__memory_base
i32.const 8678
i32.add
i32.store

I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?`

I don't see any GOT.func.printf, it is just a bunch of arbitrary constants that this code seems to know at compile time already https://codebrowser.dev/llvm/lld/wasm/InputChunks.cpp.html#363 Am I missing something?

I see. Yes, for such internal relocations we could use a loop. I believe this is because the dynamic linker knows that those symbols are defined in the main module. For the side module I think you would see more GOT-based relocations (i.e. hasGOTIndex() would be true in the above code).

Given that I agree we could split the relocations into two type, internal and external. For internal relocations, we could then use a loop rather than repeating the code sequence.

Maksim-Bozhko commented 2 years ago

Ok I lied, sorry, I didn't scroll long enough as my apply_data_reloc is ginormous. I do have GOT.func bits. I was going to suggest to at least apply this loop approach for bits that only need constants and do not need GOT.func, thanks!

Maksim-Bozhko commented 2 years ago

If this is only applicable to main module it is fine, we only have size issues with apply_data_reloc in MAIN_MODULE. Side modules do not have this issue as they are much smaller

emscripten-core / emscripten

Stop building MAIN_MODULE with RELOCATABLE (-fPIC) #12682