Open ggoodman opened 8 months ago
The term LTO normally refers to a specific thing which -flto
which means compilation/optimization at link time.
I believe what you are talking about here is normal linker DCE (dead code elimination) that we support via --gc-sections
(which is on by default). There are some limitations in the current DCE that the linker does and potenially some ways we could improve it. I believe this kind of problem can sometimes stem from the fact that link DCE is performed in two phases.
Note: (2) only happens if --gc-sections
is enabled.
One limitation that we currently have is that (1) is performed first. This is when we pull any needed object files from archives. Once we decide we need and object file its not possible to fully undo this decision, if we find out later (in 2) that we didn't need it after all.
@sunfishcode has tried to work around this limitation in some cases: https://reviews.llvm.org/D85062
Thanks for taking a look. There's a good chance I'm using the wrong terms everywhere! I'm not familiar with this whole toolchain at all. I used lto only because that was the term in the Cargo.toml
I was using to opt into this behaviour.
I think the reference (or at least one of them) that is preventing the fd_*
host bindings from being stripped away is in __stdio_exit
, called by __wasm_call_dtors
.
(func $__wasm_call_dtors (;630;) (type 72)
call $#func629<dummy>
call $__stdio_exit
)
(func $_initialize.command_export (;631;) (type 31) (result i32)
call $_initialize
call $__wasm_call_dtors
)
I think I can use a tool like walrus
to perform strategic mutations to cut out this cruft and then use its gc to get what I want.
Having spent some time looking the .wat
generated for this project and others, I notice that the tables become opaque barriers to understanding the call graph. I'm wondering if the way things are encoded into the big (elem ...)
table is making static analysis very hard. Are there some incantations that deprioritize this way of encoding functions? I don't think there should be dynamic dispatch type scenarios here but--again--I'm very new to this and have a super loose understanding of the moving parts. I'm hoping that with a less optimized binary, I could perform some surgery on it and then later run wasm-opt
against something with better visibility into control flow and reachability.
Thanks for taking a look. There's a good chance I'm using the wrong terms everywhere! I'm not familiar with this whole toolchain at all. I used lto only because that was the term in the
Cargo.toml
I was using to opt into this behaviour.
The --gc-sections
behaviour is the default and should not require any opt in. If you are using the term lto
in your build system I would guess (?) that is referring to the -flto
compiler/link flag which I think is probably not related to this issue (unless I'm misunderstanding).
I would hope that the kind of DCE you are trying achieve here would work with or without LTO enabled.
I think the reference (or at least one of them) that is preventing the
fd_*
host bindings from being stripped away is in__stdio_exit
, called by__wasm_call_dtors
.(func $__wasm_call_dtors (;630;) (type 72) call $#func629<dummy> call $__stdio_exit ) (func $_initialize.command_export (;631;) (type 31) (result i32) call $_initialize call $__wasm_call_dtors )
I think I can use a tool like
walrus
to perform strategic mutations to cut out this cruft and then use its gc to get what I want.Having spent some time looking the
.wat
generated for this project and others, I notice that the tables become opaque barriers to understanding the call graph. I'm wondering if the way things are encoded into the big(elem ...)
table is making static analysis very hard.
Indeed any time you have function pointer floating around (such as vtables, or in this case I believe musl using a kind of vtable type thing for std file handles) then DCE becomes hard since the linker cannot easily reason about the dependency graph. This is one reason why de-virtualization in C++ is useful when it works.
Are there some incantations that deprioritize this way of encoding functions? I don't think there should be dynamic dispatch type scenarios here but--again--I'm very new to this and have a super loose understanding of the moving parts. I'm hoping that with a less optimized binary, I could perform some surgery on it and then later run
wasm-opt
against something with better visibility into control flow and reachability.
I think one of the problem is that stdio in musl is using dyanmic dispatch. See https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/src/internal/stdio_impl.h#L28-L30
The dispatch tables then include pointer to the function you want to avoid including: https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/src/stdio/stdin.c#L11-L13
For __stdio_exit
it looks like there is a dummy version in exit.c
and the real version should only be included if __stdio_exit_needed
is referenced.
The problem is that I imagine what is happening is that __stdio_exit_needed
is found to be needed during the first round of symbol resolution in the linker, is that right?
@sbc100 as I noted, I'm a bit over my head here but that matches my intuition. Here's what I think is happening, again from intuition only:
I've got a dummy c library that has a strong reference to printf
and when it is compiled, those strong references cause the *_needed
to statefully toggle. Now my .a
file for the library bakes in that state.
Even though the consuming rust project only consumes a symbol from the library that has no references, they are statefully baked in and obfuscated through that dynamic dispatch mechanism.
I wonder if there's some trick that might allow us to avoid the virtualization of these FILE
methods in a way that still allows for dead-code elimination at linking time. My naive take on this would be that the FILE
methods point to real functions that do the branching in their function bodies. The branching in the function body would need to be done in a way that compilers / linkers are able to statically analyze it out.
Does that sound plausible?
If you can find a way to do what you describe that would be great. I am not aware of any way to achieve that though.
I am not sure whether what I have found is relevant for you, as it is for C++ and not Rust, but I was struggling with something similar (undesired fd_*
imports) and I managed to remove them. The fix is simple but it required me to rebuild the SDK.
I haven't dug into productising this solution for various configurations and it's possibly not an ideal default, but I'd argue it should be taken under consideration, as as of now there is no way to natively run stdlib containers in simple bare metal applications in the browser without relying on a WASI runtime or recompiling the whole SDK (which takes a while as you surely are aware of).
I have pushed a PR with more details here: #418
I saw that WebAssembly/wasi-libc#505 landed and wonder if that might be a tool to help address the issues brought up here. If that's the case, is a custom build of the sdk required to take advantage of it?
In this repo, I've shown two scenarios where code that is functionally identical produces very different WASM binaries.
Setup
The two scenarios I'm testing are for a rust
cdylib
whose code looks like this:Importantly, the
print
function is guarded by theprint
feature that is off by default.The c code we're binding looks like this:
If we run the
cargo build
withNO_PRINTF=1
, then the c code instead looks like this:Scenario 1:
WASM binary size: 2.4K
Scenario 2:
NO_PRINTF=1
WASM binary size: 459B
Conclusion
This might be totally expected given the rust compilation pipeline and the way
wasi-libc
is authored and structured, but is quite surprising to me.I was expecting that link-time optimizations would be 'smart' enough to notice that
printf
was not referenced by any exported functions in the rust library and recursively pruned. I've put together the minimal repro in hopes that it helps illustrate the observations in a reproducible way and could be helpful if this behaviour is not as intended.