UTF-8 Invalid sequences for component model strings

Forgive me if I'm missing it, but is there a discussion of how the Unicode UTR 36: UTF-8 Exploits are addressed by the component model strings?

From what I can tell looking at the CanonicalABI it looks like the string lift operation is responsible for validation and trapping on "Unicode Errors".

I'm wondering what guarantees I have as a component author that lowered strings are valid UTF-8 strings from the security perspective of that report. For example, overlong string encodings in the cited UTR 36 document and in the UTF-8 Wikipedia: Invalid Sequences and Error Handling topic are specifically described as being the cause of security issues in web services (a relevant use-case for WASM components) and potentially overlooked by decoders (WASM component authors).

Some concrete questions:

Are there guarantees that can be documented for component authors about strings lowered into a component and errors raised for improperly formatting sequences for strings lifted?
Are there compliance tests for tooling / host runtimes around expected UTF-8 validation (particularly as the security issues there are relevant to server applications of WASM components).
The canon_lower topic has a discussion point on efficient trampolines:

Since any cross-component call necessarily transits through a statically-known canon_lower+canon_lift call pair, an AOT compiler can fuse canon_lift and canon_lower into a single, efficient trampoline.

Is there a discussion of the validation expectations of such efficient trampoline optimizations? I'd assume you would still need to run the validation passes associated with a lift on a UTF-8 string to prevent issues like overlong encoding being overlooked.

My goal in the end is to make sure I'm not doing that work twice. If there are strong guarantees clearly described about what validation is done on strings I can skip doing that work or conversely make sure it is done.

Hi, great questions!

Yes. When a string is lowered into your component, the wasm running inside that component can rely on the bytes being valid in the encoding your component specified in the canonopt which would, to your first point, reliably be in the minimal form. And for lifting, invalid encodings are precisely defined to produce a fatal wasm traps that halt execution (so there's no question of what the lowering side receives in case of invalid input).
We don't have a centralized test suite yet, but the intention is to have one in this repo, analogous to what's in the core wasm spec/test directory. Work is started in #192. There's also a bunch of unit tests in wasmtime that we'll want to merge into this repo.
Good point. The expectation is that in this fused trampoline, there is a single-pass loop that both validates and copies. Because import calls are disallowed during this operation (and there are no observable side effects other than through import calls) and memory is not considered observable after a trap (due to the lockdown-on-trap rules), the loop can perform the writes and checks in any order, which should enable efficient vectorization of the loop.

So given all that, core wasm running in the component shouldn't need to validate incoming strings.

WebAssembly / component-model

UTF-8 Invalid sequences for component model strings #224