Closed lucasavila00 closed 4 months ago
Thanks for the report! I can reproduce this myself but I've moved this now from the wit-bindgen repository to the jco repository. I believe that this is a bug in jco's string encoding implementation currently. The allocation at the end needs to precisely fit the length of the string so the over-allocated size is what's causing the issue here. This is basically the exact same bug as what came up in wasm-bindgen, just in a different context.
The Canonical ABI is pretty strict about the allowable behaviors here in terms of how strings are lowered. @guybedford I think you'll want to transcode the store_utf16_to_utf8
algorithm in that document for lowerings strings from JS into the utf-8 encoding which would involve tripling the .length
of the string and allocating that number of bytes, shrinking afterwards if it's too large.
That being said I know in the past in my work on wasm-bindgen it's much more beneficial to assume ascii most of the time and then pessimize later with another allocation. In that sense the best algorithm to implement I don't think is allowed by the component model right now. See store_string_into_range
which is more-or-less the host (js in this case) is currently required to do one of those algorithms and technically isn't allowed to deviate. (my guess though is that in practice most hosts will probably deviate from the spec there though since it's nontrivial to implement exactly as-is).
Also cc @lukewagner on this since this is an interesting case coming up spec-wise. The summary is that Rust's debug assertions are catching a case where jco didn't shrink the size of a string allocation during lowering to the exact size of a string. That's easily fixable but the point I'd moreso want to raise here is that it's not clear to me how to implement "the most efficient way of transferring a string to wasm from js" given the current component model spec. For example in wasm-bindgen the current implementation looks like this (originally from https://github.com/rustwasm/wasm-bindgen/pull/1736 I believe) and looks like (a) assume ascii and copy byte-by-byte and failing that (b) use encodeInto
with TextEncoder
-specific bits.
Thanks for pointing this out. So the optimistic initial allocation does assume that each code point is ASCII and fits in a byte (hence allocating src_code_units
bytes). But perhaps the problem you're referring to is that the initial copy into this optimistically-sized buffer doesn't stop as soon as it hits non-ASCII -- it stops once it fills up the entire buffer. Thus maybe what want instead is a manual loop that copies up to the first non-ASCII code point and then immediately does the realloc before going farther; would that match the ideal JS glue?
Sort of yeah but not precisely. It's mostly that the rigidity of writing down the exact method of how string translation should happen can be limiting in some contexts. JS here is an example where, according to wasm-bindgen, the fastest way to copy a string is to (a) copy manually in JS without TextEncoder
until a non-ascii byte is encountered and (b) fall back to using TextEncoder
and resizing trickery. Such an algorithm I believe was measured to be optimal for ascii strings, and my guess is that lots of things on the web are ASCII like all DOM element names and properties and such. (user-input strings less so depending on where you are)
This specific algorithm for JS can't be implemented for components as it's not specified in the canonical ABI. That's not a break-the-bank situation per se but I mostly wanted to highlight this as a potential weakness.
If we were to change the Canonical ABI to match what JS wants to do, it seems like it would be no worse, and possibly better, for all the other host embeddings, so we could just do that (and probably this is what I should've started with). If we found a fundamental, unresolvable tension between two observably different string-copying algorithms, we could add more non-determinism to the spec to give hosts more wiggle room, but given the general portability benefits of deterministic semantics, I think we should build up that motivation with some concrete examples first.
That's true yeah, and I think concretely this would change the utf16 -> utf8 path to assume ascii first and then only start allocating one non-ascii was hit?
Yep! It'd be a pretty small tweak to the existing Python code, I think; I can work up a patch in a week.
Ok, filed component-model/#366
As far as I understand the issue, I believe this was resolved in https://github.com/bytecodealliance/jco/pull/444 and we can shortly update back to being zero copy per https://github.com/bytecodealliance/jco/issues/448 for the next version of Wasmtime.
Please correct me if I've misinterpreted any of the above, and otherwise I'll close this issue.
To reproduce:
Clone this repository https://github.com/lucasavila00/bindgencrash
It contains a
wit
world that has a function that receives a string and returnsi32
The function just returns 1 and never reads the string.
When I call the compiled wasm with some specific string, the wasm code crashes.
I wonder if this is related to the dlmalloc change https://github.com/alexcrichton/dlmalloc-rs/issues/32
This change broke wasm-bindgen (see https://github.com/rust-lang/rust/issues/124671, https://github.com/rustwasm/wasm-bindgen/issues/3801, https://github.com/rustwasm/wasm-bindgen/pull/3808)
Is wit-bindgen also affected by this?
It looks similar to the related issues, as this crash happens if a specific input String is used (eg: https://github.com/lucasavila00/bindgencrash/blob/main/tests/test-e2e.js#L14-L17) and the issue would happen when deallocating such strings.
I know if happens with wit-bindgen 0.16.0 and 0.26.0