LTO failing to prune WASI fd_* imports

ggoodman commented 8 months ago

In this repo, I've shown two scenarios where code that is functionally identical produces very different WASM binaries.

Setup

The two scenarios I'm testing are for a rust cdylib whose code looks like this:

include!(concat!(env!("OUT_DIR"), "/bindings.rs"));

#[export_name = "fallible_func"]
pub extern "C" fn fallible_func(val: i32) -> i32 {
    unsafe { Fallible_func(val) }
}

#[cfg(feature = "print")]
#[export_name = "print"]
pub extern "C" fn print(str: *const i8) {
    unsafe {
        // Call the FFI Print function with the a pointer to the C string `cstr`.
        Print(str as *mut i8);
    };
}

Importantly, the print function is guarded by the print feature that is off by default.

The c code we're binding looks like this:

#include <stdio.h>
#include <assert.h>

int Fallible_func(int value)
{
  assert(value > 0);
  return value;
}

void Print(char *str)
{
  printf("%s\n", str);
}

If we run the cargo build with NO_PRINTF=1, then the c code instead looks like this:

#include <stdio.h>
#include <assert.h>

int Fallible_func(int value)
{
  assert(value > 0);
  return value;
}

void Print(char *str)
{
  // printf("%s\n", str);
}

Scenario 1:

Note: This is all imports and exports.

  (import "wasi_snapshot_preview1" "fd_close" (func $__imported_wasi_snapshot_preview1_fd_close (;0;) (type 2)))
  (import "wasi_snapshot_preview1" "fd_fdstat_get" (func $__imported_wasi_snapshot_preview1_fd_fdstat_get (;1;) (type 3)))
  (import "wasi_snapshot_preview1" "fd_seek" (func $__imported_wasi_snapshot_preview1_fd_seek (;2;) (type 4)))
  (import "wasi_snapshot_preview1" "fd_write" (func $__imported_wasi_snapshot_preview1_fd_write (;3;) (type 5)))
  (export "memory" (memory 0))
  (export "fallible_func" (func $fallible_func.command_export))

WASM binary size: 2.4K

Scenario 2: NO_PRINTF=1

Note: This is all imports and exports. No imports at all here.

  (export "memory" (memory 0))
  (export "fallible_func" (func $fallible_func.command_export))

WASM binary size: 459B

Conclusion

This might be totally expected given the rust compilation pipeline and the way wasi-libc is authored and structured, but is quite surprising to me.

I was expecting that link-time optimizations would be 'smart' enough to notice that printf was not referenced by any exported functions in the rust library and recursively pruned. I've put together the minimal repro in hopes that it helps illustrate the observations in a reproducible way and could be helpful if this behaviour is not as intended.

sbc100 commented 8 months ago

The term LTO normally refers to a specific thing which -flto which means compilation/optimization at link time.

I believe what you are talking about here is normal linker DCE (dead code elimination) that we support via --gc-sections (which is on by default). There are some limitations in the current DCE that the linker does and potenially some ways we could improve it. I believe this kind of problem can sometimes stem from the fact that link DCE is performed in two phases.

Resolve all the symbols.
Walk the symbol dependency tree marking what is live (https://github.com/llvm/llvm-project/blob/main/lld/wasm/MarkLive.cpp) then don't include anything that is not live in the output.

Note: (2) only happens if --gc-sections is enabled.

One limitation that we currently have is that (1) is performed first. This is when we pull any needed object files from archives. Once we decide we need and object file its not possible to fully undo this decision, if we find out later (in 2) that we didn't need it after all.

@sunfishcode has tried to work around this limitation in some cases: https://reviews.llvm.org/D85062

ggoodman commented 8 months ago

Thanks for taking a look. There's a good chance I'm using the wrong terms everywhere! I'm not familiar with this whole toolchain at all. I used lto only because that was the term in the Cargo.toml I was using to opt into this behaviour.

I think the reference (or at least one of them) that is preventing the fd_* host bindings from being stripped away is in __stdio_exit, called by __wasm_call_dtors.

  (func $__wasm_call_dtors (;630;) (type 72)
    call $#func629<dummy>
    call $__stdio_exit
  )
  (func $_initialize.command_export (;631;) (type 31) (result i32)
    call $_initialize
    call $__wasm_call_dtors
  )

I think I can use a tool like walrus to perform strategic mutations to cut out this cruft and then use its gc to get what I want.

Having spent some time looking the .wat generated for this project and others, I notice that the tables become opaque barriers to understanding the call graph. I'm wondering if the way things are encoded into the big (elem ...) table is making static analysis very hard. Are there some incantations that deprioritize this way of encoding functions? I don't think there should be dynamic dispatch type scenarios here but--again--I'm very new to this and have a super loose understanding of the moving parts. I'm hoping that with a less optimized binary, I could perform some surgery on it and then later run wasm-opt against something with better visibility into control flow and reachability.

sbc100 commented 8 months ago

Thanks for taking a look. There's a good chance I'm using the wrong terms everywhere! I'm not familiar with this whole toolchain at all. I used lto only because that was the term in the Cargo.toml I was using to opt into this behaviour.

The --gc-sections behaviour is the default and should not require any opt in. If you are using the term lto in your build system I would guess (?) that is referring to the -flto compiler/link flag which I think is probably not related to this issue (unless I'm misunderstanding).

I would hope that the kind of DCE you are trying achieve here would work with or without LTO enabled.

I think the reference (or at least one of them) that is preventing the fd_* host bindings from being stripped away is in __stdio_exit, called by __wasm_call_dtors.
  (func $__wasm_call_dtors (;630;) (type 72)
    call $#func629<dummy>
    call $__stdio_exit
  )
  (func $_initialize.command_export (;631;) (type 31) (result i32)
    call $_initialize
    call $__wasm_call_dtors
  )
I think I can use a tool like walrus to perform strategic mutations to cut out this cruft and then use its gc to get what I want.

Having spent some time looking the .wat generated for this project and others, I notice that the tables become opaque barriers to understanding the call graph. I'm wondering if the way things are encoded into the big (elem ...) table is making static analysis very hard.

Indeed any time you have function pointer floating around (such as vtables, or in this case I believe musl using a kind of vtable type thing for std file handles) then DCE becomes hard since the linker cannot easily reason about the dependency graph. This is one reason why de-virtualization in C++ is useful when it works.

Are there some incantations that deprioritize this way of encoding functions? I don't think there should be dynamic dispatch type scenarios here but--again--I'm very new to this and have a super loose understanding of the moving parts. I'm hoping that with a less optimized binary, I could perform some surgery on it and then later run wasm-opt against something with better visibility into control flow and reachability.

I think one of the problem is that stdio in musl is using dyanmic dispatch. See https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/src/internal/stdio_impl.h#L28-L30

The dispatch tables then include pointer to the function you want to avoid including: https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/src/stdio/stdin.c#L11-L13

sbc100 commented 8 months ago

For __stdio_exit it looks like there is a dummy version in exit.c and the real version should only be included if __stdio_exit_needed is referenced.

The problem is that I imagine what is happening is that __stdio_exit_needed is found to be needed during the first round of symbol resolution in the linker, is that right?

ggoodman commented 8 months ago

@sbc100 as I noted, I'm a bit over my head here but that matches my intuition. Here's what I think is happening, again from intuition only:

I've got a dummy c library that has a strong reference to printf and when it is compiled, those strong references cause the *_needed to statefully toggle. Now my .a file for the library bakes in that state.

Even though the consuming rust project only consumes a symbol from the library that has no references, they are statefully baked in and obfuscated through that dynamic dispatch mechanism.

ggoodman commented 7 months ago

I wonder if there's some trick that might allow us to avoid the virtualization of these FILE methods in a way that still allows for dead-code elimination at linking time. My naive take on this would be that the FILE methods point to real functions that do the branching in their function bodies. The branching in the function body would need to be done in a way that compilers / linkers are able to statically analyze it out.

Does that sound plausible?

sbc100 commented 7 months ago

If you can find a way to do what you describe that would be great. I am not aware of any way to achieve that though.

infinitesnow commented 6 months ago

I am not sure whether what I have found is relevant for you, as it is for C++ and not Rust, but I was struggling with something similar (undesired fd_* imports) and I managed to remove them. The fix is simple but it required me to rebuild the SDK.

I haven't dug into productising this solution for various configurations and it's possibly not an ideal default, but I'd argue it should be taken under consideration, as as of now there is no way to natively run stdlib containers in simple bare metal applications in the browser without relying on a WASI runtime or recompiling the whole SDK (which takes a while as you surely are aware of).

I have pushed a PR with more details here: #418

ggoodman commented 1 month ago

I saw that WebAssembly/wasi-libc#505 landed and wonder if that might be a tool to help address the issues brought up here. If that's the case, is a custom build of the sdk required to take advantage of it?

WebAssembly / wasi-sdk

LTO failing to prune WASI fd_* imports #401

Setup

Conclusion