bytecodealliance / wasm-tools

CLI and Rust libraries for low-level manipulation of WebAssembly modules
Apache License 2.0
1.29k stars 232 forks source link

wasmparser -> wasm-encoder bridge #1588

Closed juntyr closed 2 months ago

juntyr commented 3 months ago

In my usecase, I perform simple transformations on a WASM module (e.g. adding a few instructions to all existing functions). While walrus with its convenient high-level API has served this usecase in the past, it is falling behind on supporting newer WASM features, and falling back to a low-level API is possible for me as well. Using wasmparser and Wasm-encoder myself seems overkill, however, as it exposes my code to the full complexity of the WASM binary format and the fast-developing APIs of both crates.

As both wasmparser and wasm-encoder are jointly developed in this repo, it would be great if a new crate could be developed to bridge the two. Ideally, this crate would hook into wasmparsers streaming API and recreate the same WASM module using wasm-encoder on the fly. A shallow stream adapter API could then be used to modify items on the fly or to insert new ones, i.e. the API would allow any wasmparser stream to go through an adapter function which takes the iterator and produces and iterator which is then consumed by wasm-encoder. To keep the library minimal and low-level, the user would still need to take care of e.g. renumbering indices if additional imports are added.

TLDR: Adding a bridge crate between wasmparser and wasm-encoder that optionally feeds the parsed streams through user-provided iterator adapters and then into wasm-encoder would be awesome :)

alexcrichton commented 3 months ago

Thanks for the report! Nowadays I think this'd definitely fit well in the repository in the sense that with the current release process everything is released as a batch so it's fine to have inter-crate dependencies be deep and public. There's actually already a wasmparser feature on the wasm-encoder crate with a few From impls, but I think it'd be reasonable to expand that with more functions and/or more conversions. For example there could be constructors a long the lines of "build this entire section from this wasmparser section" or other things like "initialize this function's locals with this wasmparser function's locals" (or body) or similar.

Basically I think this would be a good idea to do, and I think it'd be best done by adding more code to wasm-encoder rather than a new crate on top of the two.

juntyr commented 3 months ago

Basically I think this would be a good idea to do, and I think it'd be best done by adding more code to wasm-encoder rather than a new crate on top of the two.

That sounds like a good plan! I had a quick look at the feature and it seems that it provides just a few low-level conversions for now.

A first step would probably be to implement every possible From-impl and then some FromIterator or Extend adapters for the stream-based APIs. Then, it should be possible to parse an existing module as a stream and to pass on that stream to the encoder to re-encode the module, such that it can be re-parsed to complete a full roundtrip.

Having such roundtrips would already allow many small-edit cases by using all From-impls for unchanged sections but at the cost of manually re-coding every path where something has to be changed. It would then probably become clear quite quickly which conversion impls would benefit from custom utility functions to write code for what is actually modified.

alexcrichton commented 3 months ago

I think that's a great first step :+1:

juntyr commented 3 months ago

I tried to code up a small demo for parsing and encoding core-wasm modules (without GC) here (without any error handling so far) - does the approach look broadly correct (the enabled parser features are a project-specific detail of mine)?

https://gist.github.com/juntyr/95629d82c1b1e8048a70d74b934d67cc

Doing this also brought out a few small conversion functions and a large one (instructions) that I could add to make this code a lot cleaner.

alexcrichton commented 3 months ago

I think that's pretty reasonable yeah. I didn't make a PR but I made a branch a few weeks ago sketching out what this might look like in wasm-encoder itself by basically adding lots of little helper methods all over the place (and more Into/From impls as you've already found as well). I didn't get so far as to figure out what converting the top-level module would look like though.

The reason I didn't end up pursuing this further is that I couldn't shake the feeling that this was close to the desired abstraction but not precisely it because all the existing parser->encoder rewrites in wasm-tools are all doing some form of mutation. For example changing index spaces or dropping functions or things like that. The structure of "just take the wasmparser bits and put them here" didn't work well with that. I was starting to envision a trait of some kind which allows mapping indices and all conversion operations are defined on that trait instead of Into/From (or something like that) but I didn't bottom it out enough to make a PR.

If you find the use case though of "just take parser bits and put them in the encoder" useful then I think it's reasonable to submit PRs for the helpers you've implemented or I can also try to clean up the branch I had.

juntyr commented 3 months ago

I think that's pretty reasonable yeah. I didn't make a PR but I made a branch a few weeks ago sketching out what this might look like in wasm-encoder itself by basically adding lots of little helper methods all over the place (and more Into/From impls as you've already found as well).

Oh wow this branch looks fantastic, it would make my code example very clean to write!

I didn't get so far as to figure out what converting the top-level module would look like though.

The reason I didn't end up pursuing this further is that I couldn't shake the feeling that this was close to the desired abstraction but not precisely it because all the existing parser->encoder rewrites in wasm-tools are all doing some form of mutation. For example changing index spaces or dropping functions or things like that. The structure of "just take the wasmparser bits and put them here" didn't work well with that.

I think the main value in these helped functions is to handle everything which you don’t care about (changing). One of my transformations touches every function but doesn’t care about the other sections. In this case, every non-code section would use the new helpers, while the code one would be manually spelled out but still use the instruction translation helper to copy over most instructions as is. I think this is a use case that can be well-supported with these helpers.

I also have the other use case you mention where I need to do some reindexing of global IDs. For this, a different interface is needed, something akin to a mutable in-place visitor that is called for every ID matching the type (e.g. global ID). This should be done with a new trait that does the parse-encoding but also calls the visitor - the from impls would then be the special case where no mutation has to be made (and they could be implanted internally with a no-op visitor).

This second case indeed looks like it will require more design work, but I think that this would be additive and could be done after the simpler API lands and we see where a visitor mutation API is needed.

I was starting to envision a trait of some kind which allows mapping indices and all conversion operations are defined on that trait instead of Into/From (or something like that) but I didn't bottom it out enough to make a PR.

If the indices were more strongly typed, then a conversion trait based approach could indeed make sense here, but I think that a visitor for e.g. “call me on every function index” would also work.

If you find the use case though of "just take parser bits and put them in the encoder" useful then I think it's reasonable to submit PRs for the helpers you've implemented or I can also try to clean up the branch I had.

I think all little helpers I wrote are already in your branch, and the bigger one would mostly make sense as application logic to start with.

How about I play around with your branch tomorrow to see how it would work for my usecase so that I can help with the cleanup and making a PR from it?

Thank you again for all of your help!

alexcrichton commented 3 months ago

How about I play around with your branch tomorrow to see how it would work for my usecase so that I can help with the cleanup and making a PR from it?

Please do!


FWIW my rough thinking was something like:

trait WasmparserToWasmEncoder {
    fn func(&self, idx: u32) -> u32 { idx }
    fn memory(&self, idx: u32) -> u32 { idx }
    // same for all other items like tables/globals/tags/etc

    fn memarg(&self, arg: &wasmparser::MemArg) -> wasm_encoder::MemArg { /* default impl */ }
    // all other helpers which are currently Into/From
}

struct NoopRemap;

impl Remap for NoopRemap {}

impl From<wasmparser::MemArg> for wasm_encoder::MemArg {
    fn from(arg: wasmparser::MemArg) -> wasm_encoder::MemArg {
        NoopRemap.memarg(arg)
    }
}

or... something like that. I would only want to use this if it could actually replace what all the various remapping places we have already today are doing. For example the gc bits of wit-component would be great to use an interface like this instead of hand-coding a bunch. Those bits are also quite tricky though in terms of other extra logic, in addition to things like wasm-mutate. Overall this trait seemed like overkill and I couldn't figure out how to actually integrate it.

In any case this is all orthogonal to what you're looking to add, so happy to review a PR!

juntyr commented 3 months ago

Ok, the following slightly less messy code is able to perfectly roundtrip all core WASM modules that my usecase throws at it:

#[allow(clippy::too_many_lines)]
pub fn parse_encode(wasm: &[u8]) -> Vec<u8> {
    macro_rules! parse_payload {
        (if let $pat:pat = $payload:ident $block:block for $parser:ident ( $wasm:ident )) => {
            if let $pat = $payload {
                $block

                next_payload(&mut $parser, $wasm)
            } else { ($payload, $wasm) }
        };
        (while let $pat:pat = $payload:ident $block:block for $parser:ident ( $wasm:ident )) => {{
            let mut payload = $payload;
            let mut wasm = $wasm;

            loop {
                if let $pat = payload {
                    $block

                    (payload, wasm) = next_payload(&mut $parser, wasm);
                } else { break (payload, wasm) }
            }
        }};
    }

    fn next_payload<'a>(
        parser: &mut wasmparser::Parser,
        wasm: &'a [u8],
    ) -> (wasmparser::Payload<'a>, &'a [u8]) {
        match parser.parse(wasm, true).unwrap() {
            wasmparser::Chunk::NeedMoreData(_) => panic!(),
            wasmparser::Chunk::Parsed { consumed, payload } => (payload, &wasm[consumed..]),
        }
    }

    let mut parser = wasmparser::Parser::new(0);
    parser.set_features(
        wasmparser::WasmFeaturesInflated {
            // MUST: floats are required and we are running the NaN
            //       canonicalisation transform to make them deterministic
            floats: true,
            ..crate::engine::DETERMINISTIC_WASM_MODULE_FEATURES
        }
        .into(),
    );

    let mut module = wasm_encoder::Module::new();

    let (payload, wasm) = next_payload(&mut parser, wasm);

    let wasmparser::Payload::Version { encoding, .. } = payload else {
        panic!();
    };

    if !matches!(encoding, wasmparser::Encoding::Module) {
        panic!();
    }

    let (payload, wasm) = next_payload(&mut parser, wasm);

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::TypeSection(reader) = payload {
            let mut types = wasm_encoder::TypeSection::new();
            types.parse_section(reader).unwrap();
            module.section(&types);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::ImportSection(reader) = payload {
            let mut imports = wasm_encoder::ImportSection::new();
            imports.parse_section(reader).unwrap();
            module.section(&imports);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::FunctionSection(reader) = payload {
            let mut functions = wasm_encoder::FunctionSection::new();
            functions.parse_section(reader).unwrap();
            module.section(&functions);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::TableSection(reader) = payload {
            let mut tables = wasm_encoder::TableSection::new();
            tables.parse_section(reader).unwrap();
            module.section(&tables);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::MemorySection(reader) = payload {
            let mut memories = wasm_encoder::MemorySection::new();
            memories.parse_section(reader).unwrap();
            module.section(&memories);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::GlobalSection(reader) = payload {
            let mut globals = wasm_encoder::GlobalSection::new();
            globals.parse_section(reader).unwrap();
            module.section(&globals);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::ExportSection(reader) = payload {
            let mut exports = wasm_encoder::ExportSection::new();
            exports.parse_section(reader).unwrap();
            module.section(&exports);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::StartSection { func, .. } = payload {
            module.section(&wasm_encoder::StartSection {
                function_index: func,
            });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::ElementSection(reader) = payload {
            let mut elements = wasm_encoder::ElementSection::new();
            elements.parse_section(reader).unwrap();
            module.section(&elements);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::DataCountSection { count, .. } = payload {
            module.section(&wasm_encoder::DataCountSection { count });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let mut codes = None;

    let (payload, wasm) = parse_payload! {
        if let wasmparser::Payload::CodeSectionStart { count: _count, .. } = payload {
            codes = Some(wasm_encoder::CodeSection::new());
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CodeSectionEntry(reader) = payload {
            let mut function = wasm_encoder::Function::new_parsed_locals(&reader).unwrap();
            let instructions = reader.get_operators_reader().unwrap();
            for instruction in instructions {
                function.instruction(&instruction.unwrap().try_into().unwrap());
            }
            codes.as_mut().unwrap().function(&function);
        } for parser(wasm)
    };

    if let Some(codes) = codes {
        module.section(&codes);
    }

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::DataSection(reader) = payload {
            let mut data = wasm_encoder::DataSection::new();
            data.parse_section(reader).unwrap();
            module.section(&data);
        } for parser(wasm)
    };

    let (payload, wasm) = parse_payload! {
        while let wasmparser::Payload::CustomSection(reader) = payload {
            module.section(&wasm_encoder::CustomSection { name: reader.name().into(), data: reader.data().into() });
        } for parser(wasm)
    };

    let wasmparser::Payload::End(_) = payload else {
        panic!();
    };

    if !wasm.is_empty() {
        panic!();
    }

    let wasm = module.finish();

    wasmparser::Validator::new_with_features(
        wasmparser::WasmFeaturesInflated {
            // MUST: floats are required and we are running the NaN
            //       canonicalisation transform to make them deterministic
            floats: true,
            ..crate::engine::DETERMINISTIC_WASM_MODULE_FEATURES
        }
        .into(),
    )
    .validate_all(&wasm)
    .unwrap();

    wasm
}
juntyr commented 3 months ago

FWIW my rough thinking was something like:

trait WasmparserToWasmEncoder {
    fn func(&self, idx: u32) -> u32 { idx }
    fn memory(&self, idx: u32) -> u32 { idx }
    // same for all other items like tables/globals/tags/etc

    fn memarg(&self, arg: &wasmparser::MemArg) -> wasm_encoder::MemArg { /* default impl */ }
    // all other helpers which are currently Into/From
}

struct NoopRemap;

impl Remap for NoopRemap {}

impl From<wasmparser::MemArg> for wasm_encoder::MemArg {
    fn from(arg: wasmparser::MemArg) -> wasm_encoder::MemArg {
        NoopRemap.memarg(arg)
    }
}

or... something like that. I would only want to use this if it could actually replace what all the various remapping places we have already today are doing. For example the gc bits of wit-component would be great to use an interface like this instead of hand-coding a bunch. Those bits are also quite tricky though in terms of other extra logic, in addition to things like wasm-mutate. Overall this trait seemed like overkill and I couldn't figure out how to actually integrate it.

I do actually really like this approach!

Now that I've written out a roundtrip parse-encoder for just core wasm modules (which is what I'll focus on), I think that making everyone rewrite that anytime they want to change something is not a good user experience. However, it would fit in perfectly with your trait idea! Essentially, the highest level method it would provide would be this one - parse all sections and re-emit them as found. For each section found, we'd then have a trait method that is just given this section and can do anything it wants with it, where the default implementation is to rountrip-encode it using even lower-level trait methods. That way, if I want to only fiddle with function bodies but don't care about anything else, I'd only have to override the method that fiddles with function bodies. For cases where I only need to inspect a section but not modify it, we could publicly export free functions that handle the default behaviour (which are called by the default trait impl) so that they can be reused as well.

juntyr commented 3 months ago

Ok, I've played around with the branch a bit. Out of my wasm transformation cases, it solves 2.75/3 ... the remaining bit is of course an index renumbering.

I'll see how bad the code looks with doing that manually

alexcrichton commented 3 months ago

Ok, the following slightly less messy code is able to perfectly roundtrip all core WASM modules that my usecase throws at it:

One thing perhaps worth pointing out is that if you're not actually modifying a section it'll be much faster to do something like this which is basically a memcpy vs parse-then-translate-to-wasm-encoder-then-encode. The "parse the whole section" methods are mostly only useful if you're editing/dropping items or prepending/appending items.

Essentially, the highest level method it would provide would be this one - parse all sections and re-emit them as found

Agreed! That's the high-level idea at least, I think it still needs to be proven out a bit first with a use case or two. (e.g. what you're doing)

juntyr commented 3 months ago

Do you have an example of where inside wasm-tools some index remapping is taking place?

juntyr commented 3 months ago

I think we can definitely move forward with your branch (I rebased it onto main and I can take care of any cleanup we need) to add that functionality already. In parallel to that, I can try to sketch out the more powerful API with my usecase, which can then later become an implementation detail of the your-branch-helper methods (and also be exposed separately).

alexcrichton commented 3 months ago

Inside of wit-component is one example and wasm-mutate has a whole framework sort of like this which might be able to be used wholesale too. Both are pretty nontrivial though in how they're integrated with the rest of the crate so may not be the best examples.

The simplest read-modify-write examples are wasm-tools strip and wasm-tools demangle, but they don't touch indices just custom sections. (but it'd be much nicer to write those with a higher level abstraction)

juntyr commented 2 months ago

I'm happy to close this now that #1628 has landed :)