WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.41k stars 695 forks source link

Self contained POSIX like binaries for testing #1236

Open pepyakin opened 6 years ago

pepyakin commented 6 years ago

We have the testsuite and it's an excellent tool for validating correctness of the implementation. However, I think that it still not not enough and it would be a lot more useful to test against large real world programs, that has various code patterns, different sizes of functions and so on.

But there is a problem: Wasm is concieved as an embeddable platform and the binaries might expect a certain environment to be available. This is in contrast to be a self-contained platform which supports running any binaries.

For example, if we want to test emscripten generated binary in non-JS setting, we have to provide the POSIX like environment that emscripten provides. In wasmi we will have to implement it in Rust, in wabt/wasm2c we have to implement it in C++, and in asmble we have to implement it in Java. Even worse, now they have different API for host bindings and might require different glue code (wasm-c-api might help here eventually though).

What if we had an implementation of emscripten compatible library that is implemented purely in WebAssembly, and can satisfy all imports of an emscripten binary if linked with the binary. Every syscall perform all it's work in-memory: in-memory I/O, in-memory stdin/stdout/stderr buffers. We should probably have some configuration ability for initial environment (like initial FS structure, stdin contents, initial values for clock or seed). This can be achieved by creating another object file that sets up the initial environment which is linked with the whole thing.

After the linking you will have a binary that only has a start function and no imports. After that you can run this binary with any VM. To verify the correctness of the execution you just have to compare contents of memory and global variables.

With such a tool you can easily (well, easier than implementing a whole emscripten API) create a set of real-world program binaries for testing, e.g. clang.

How does this sound? Is this feasible? Any problems you see? Would someone be interested in using or/and collaborating on such a project?

@sunfishcode @binji @rossberg @cretz

cretz commented 6 years ago

I would definitely appreciate torture/stress tests with large, real-world applications without requiring JS.

If I'm understanding right, it seems like a lot of work to either hand-write this in WASM, or essentially impl an in-memory-only libc in Rust or C or something. There is probably an opportunity for more than just a test harness for an all-WASM POSIX impl. You could even go half way and impl what you need and defer to a much smaller set of functions that could be imported from the host e.g. how linux does syscalls (host bindings can't come soon enough). I'm probably not available to help that much with it these days, but would definitely work it in to my test cases.

Some of my biggest stress tests lately have been from Go which would not benefit from this effort much, but luckily the required imports from the host for that runtime is a very small and easy to implement list.

sunfishcode commented 6 years ago

I'm hoping that with projects such as the reference-sysroot, and discussions around the idea of builtin modules, that some common syscall-like APIs can be established, and then we can develop a variety of compatible implementations. Your idea of an implementation that works entirely within in-memory data structures in wasm sounds useful!

binji commented 6 years ago

I like the idea of an in-memory libc, though I don't think it's going to be enough for many programs. I'm sure you've found this too, but an emscripten-generated wasm file doesn't have quite enough information to run. You often need to extract a bit more info about the memory layout from the JS file to do this. This seems like a pretty easy fix, though (maybe an emscripten custom section?)

The bigger issue is that many real-world emscripten programs have custom JavaScript imports that need to be implemented. Even relatively self-contained apps like funky karts require some custom code for the EM_ASM parts. That said, it should be enough for relatively simple command-line apps.

pepyakin commented 6 years ago

@binji My text mentioned emscripten a lot, but maybe that was because I have never worked with clang. : )

By real-world programs I mean something like clang, ffmpeg, python or anything that executes a lot of code and can be feasibly executed in a such emulated environment. Have no idea if it's feasible to compile them with clang now.

@sunfishcode Haven't heard about the reference-sysroot project. Sounds useful!

@cretz I think it doesn't have to be a libc per se. Maybe we could reuse musl wasm proto and implement syscall interface for that. But anyway yeah it should be a significant amount of work.