WebAssembly / binaryen

Optimizer and compiler/toolchain library for WebAssembly
Apache License 2.0
7.41k stars 731 forks source link

Input validation error #1264

Open msprotz opened 6 years ago

msprotz commented 6 years ago

Hello,

Trying wasm-opt on one of my wasm files, I get:

Fatal: error in validating input

This is a relatively unhelpful error. After translating to wast, I get:

jonathan@chartreuse:~/Code/hacl-star/code/ed25519/ed25519-c (master) $ wasm-opt Ed25519.wast 
[parse exception: duplicate import (at 209:2)]Fatal: error in parsing input

which now seems to be an error in parsing rather than validating?

jonathan@chartreuse:~/Code/hacl-star/code/ed25519/ed25519-c (master) $ sed -n 209,210p Ed25519.wast
  (import "Kremlin" "data_start" (global $0 i32))
  (import "C" "exit_success" (global $1 i32))

There is no Kremlin.wast or Kremlin.wasm in the current directory as I expect these to be passed by the outer Javascript environment when instantiating the Ed25519 module.

Thanks!

Jonathan

kripken commented 6 years ago

Very odd about the wasm error message - the validator should emit a complete message. In fact there is a PR that shortens it since it was too long ;) So something is going wrong on your file somehow. Perhaps are you just viewing stdout or stderr and not both, and we emit to the other? If that's not it, can you provide the wasm file, to reproduce the problem?

Is the wast parsing error looks correct (duplicate import)? If not, please provide the wast too. Note btw that the errors you see on wast and wasm files may differ, as how they are parsed is not identical - wasts are parsed in multiple passes, in particular. So what is a validation error may be a parse error in the other, but ignoring that, what is invalid in one should be invalid in the other, and only then.

Imports are not actually imported when we scan for errors, so it shouldn't matter which files are in your directory - it won't look for Kremlin.wasm when it checks that module.

msprotz commented 6 years ago

https://jonathan.protzenko.fr/whacl-star/ed25519-c/Ed25519.wasm https://jonathan.protzenko.fr/whacl-star/ed25519-c/Ed25519.wast

Note that https://jonathan.protzenko.fr/whacl-star/ed25519-c/main.html will happily load, instantiate and execute that module in your browser.

Thanks,

Jonathan

kripken commented 6 years ago

Thanks for the files. Yeah, both cases look like they are bugs, I'll investigate.

kripken commented 6 years ago
  1. The wast text error is because binaryen's text format assumes all imports share a single namespace. So an imported memory and global with the same name causes a collision. This might be a place where we diverge from the official text format, so it's worth investigating more and perhaps fixing. But we already diverge so much on the text format that this is low priority (that is, we don't claim to have a compatible text format).
  2. The wasm binary error is a serious bug, a type of stack-based code that we didn't handle. Details and fix in #1267. After that PR, we can read the wasm without error, and optimization also works fine, e.g. -Os shrinks the file to around half the original size.

Btw, what compiler was used to generate these? I don't think I'm familiar with it.

msprotz commented 6 years ago

Wonderful, thanks very much! That was fast :)

The code comes from my KreMLin compiler https://github.com/FStarLang/kremlin . KreMLin takes code written in F* and generates C or WASM. F* is a verification-oriented language, meaning the programmer writes code that enjoys memory safety, functional correctness, and a degree of resistance to side-channels. Kremlin takes that and generates C code and now, WASM code too.

Our flagship application is hacl-star (see https://github.com/mitls/hacl-star/), a set of verified cryptographic algorithms that enjoy the guarantees above. Mozilla's NSS recently integrated one our of algorithms. We now intend to ship a WASM version of the library. The generated code is very minimalistic, and the toolchain is very simple, meaning we are relatively confident that the guarantees that hold for the original F* code also apply to the resulting WASM code.

The landing page of HACL* has pointers to papers, if you're curious.

Now for more details:

The reason for data_start being exported by every one of my modules is that each module lays out its own constant strings in the data segment (using relative addressing), then relies on a custom "loader" to tell the next module where exactly the start of the data segment is. Perhaps there's a more idiomatic way of doing this? The relevant code is in propagate at, say, https://jonathan.protzenko.fr/whacl-star/ed25519-c/loader.js

Cheers,

Jonathan

kripken commented 6 years ago

Very interesting, thanks for the info!

I'm curious about the Low => wasm path that you're adding: since you already had Low => C, and could do C => wasm, what are the main reasons for adding the "bypass" route directly to wasm?

Another question, looking in the code, I saw the AstToWasm file, which looked like it has most of the compilation logic, but I couldn't figure out how it does the final part, if it compiles to a wasm binary or text?

About data_start, that sounds good, I don't think there's a better way - if each module has its own constant data, then they can receive an imported integer of where to place it, and inform how much space they used etc., and the loader coordinates all that so nothing collides. That's basically how dynamic linking for C/C++ etc. works.

Let me know if there's more stuff we can help with on the wasm side for this.

msprotz commented 6 years ago

The bypass route has multiple advantages:

@prosecco can comment more on the advantages of the "bypass" toolchain I'm sure.

The wasm binary / wast generation is done using https://github.com/WebAssembly/spec -- I directly reuse Andreas' library which is also written in OCaml. No extra work for me! The W module in https://github.com/FStarLang/kremlin/blob/master/src/CFlatToWasm.ml#L4 is Andreas' library, and I convert my AST to binary here: https://github.com/FStarLang/kremlin/blob/master/src/OutputJs.ml#L79

So far, things have been working out fine for us with Wasm. I've lobbied @rossberg extensively for tail-call optimization and SIMD support, which are the two big features that would change our life...

Thanks,

Jonathan

kripken commented 6 years ago

Makes sense, thanks for the info. Yeah, I think those are reasons we'll see more compilers take a similar path too (although the code size issue may be addressed by emscripten's wasm-standalone path eventually; still, even then if you write your own compiler you can of course do better).

About avoiding optimizations, I guess you'd also avoid running the binaryen optimizer (or another wasm => wasm optimizer) then? Or is there a chance you'd run a subset of those optimizations? I'm not sure how much code size and speed are the focus for you, but I ask because it's a goal for the binaryen optimizer to help a variety of compilers (not just LLVM output).

The wasm binary / wast generation is done using https://github.com/WebAssembly/spec -- I directly reuse Andreas' library which is also written in OCaml. No extra work for me!

Oh, nice :) Heh, I did have a feeling the AST style looked familiar, I should have recognized the spec stuff, and I should have guessed based on the language the compiler's written in too... ;)

rossberg commented 6 years ago

I'd also hope that it is easier to generate Wasm code than to output a complicated language like C. (Especially once you need features that C doesn't even provide, like exceptions or tail calls.)

prosecco commented 6 years ago

As Andreas says, bypassing C may often be better for compiling DSLs that may more naturally translate to WASM.

For our source language and examples, which are predominantly focused on security-centric applications, there is at least three important advantages in targeting WASM, even though we also generate C code from the same toolchain, so we could easily have used emscripten.

1) We carefully verify side-channel mitigations in our source code and we want to preserve them at least all the way to WASM. We have some ideas on verifying that the assembled/interpreted versions of WASM also preserve these properties. This would have been harder to guarantee for a full-fledged compiler like llvm/emscripten.

2) We want our generated code to be defensive against common JavaScript flaws. This means that within our JavaScript wrapper code we want to minimize any dependence on the tamperable JS environment. In previous work, we managed to isolate security-critical code using a statically typed subset of JavaScript (http://antoine.delignat-lavaud.fr/doc/usenixsec13.pdf http://antoine.delignat-lavaud.fr/doc/usenixsec13.pdf) We think WASM is much better for this kind of defensive programming, as long as the JS<->WASM API is carefully designed. Again, writing our own compiler helps us keep tight control over this kind of isolation guarantee.

3) We want the generated crypto code to be small and potentially reviewable/auditable. When we compile our code directly from F* to WASM we end up with a 20K file. An emscripten compiled version of Curve25519, as used by Signal Desktop, for example, is over 200KB.

On 6 Nov 2017, at 19:40, Jonathan Protzenko notifications@github.com wrote:

The bypass route has multiple advantages:

simpler compilation scheme, meaning a much smaller compiler to trust (i.e. a couple thousand lines of ML rather than all of LLVM) more control over things such as side-channel preservation (mostly relying on the fact that no optimizations are performed... of course, this doesn't prevent JITs from introducing side-channels) a much smaller codesize than what emscripten generates (from what we could tell from our initial experiments). @prosecco https://github.com/prosecco can comment more on the advantages of the "bypass" toolchain I'm sure.

The wasm binary / wast generation is done using https://github.com/WebAssembly/spec https://github.com/WebAssembly/spec -- I directly reuse Andreas' library which is also written in OCaml. No extra work for me! The W module in https://github.com/FStarLang/kremlin/blob/master/src/CFlatToWasm.ml#L4 https://github.com/FStarLang/kremlin/blob/master/src/CFlatToWasm.ml#L4 is Andreas' library, and I convert my AST to binary here: https://github.com/FStarLang/kremlin/blob/master/src/OutputJs.ml#L79 https://github.com/FStarLang/kremlin/blob/master/src/OutputJs.ml#L79 So far, things have been working out fine for us with Wasm. I've lobbied @rossberg https://github.com/rossberg extensively for tail-call optimization and SIMD support, which are the two big features that would change our life...

Thanks,

Jonathan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/binaryen/issues/1264#issuecomment-342244233, or mute the thread https://github.com/notifications/unsubscribe-auth/AEYAnvJ5utg2Q937D8a8tAcEssOYzI30ks5sz1KmgaJpZM4QQeu2.

kripken commented 6 years ago

@prosecco

When we compile our code directly from F* to WASM we end up with a 20K file. An emscripten compiled version of Curve25519, as used by Signal Desktop, for example, is over 200KB.

Interesting, I assume that is emscripten normally, not in wasm-standalone mode, i.e., without -s SIDE_MODULE=1? I'd be curious to see a comparison with that flag too, if it's not too much trouble. You would need to write your own js loading code in that case, but the compiled output should only contain your own code, in which case I'd expect it to be smaller than even the F* to wasm's 20K, because it shouldn't have extraneous code, and it has been optimized for size by LLVM and binaryen. (On the other hand, those optimizations might increase code size too, but in -Os or -Oz mode they shouldn't.)

Of course maybe you don't want to go that route for the other good reasons mentioned. I'm just curious if there is something we could do better for code size that we aren't doing already, in that mode.