Users of UTF-8 support - Githubissues

eqrion commented 2 months ago

I'm wondering if anyone out there is using (or plans to use) the text-encoder and text-decoder interfaces that this proposal has. I know of a user for the main 'js-string' interface (j2wasm), but not the rest.

If we try to go to phase-4, I think we would need to have a toolchain that's actually testing out the UTF-8 support to have any confidence it's the right interface.

cc @wingo @vouillon who have commented about UTF-8/WTF-8 support.

sjrd commented 2 months ago

We have a branch where we use the base proposal for Scala.js (with significant performance improvements compared to user-space helpers, btw), but no UTF-8 for us, I'm afraid.

Pauan commented 2 months ago

Rust would benefit from having direct UTF-8 <=> String functions in Wasm.

Right now we have to import some JS glue code:

const cachedTextDecoder = (typeof TextDecoder !== 'undefined'
    ? new TextDecoder('utf-8', { ignoreBOM: true, fatal: true })
    : { decode: () => { throw Error('TextDecoder not available') } } );

if (typeof TextDecoder !== 'undefined') { cachedTextDecoder.decode(); };

let cachedUint8Memory0 = null;

function getUint8Memory0() {
    if (cachedUint8Memory0 === null || cachedUint8Memory0.byteLength === 0) {
        cachedUint8Memory0 = new Uint8Array(wasm.memory.buffer);
    }
    return cachedUint8Memory0;
}

function getStringFromWasm0(ptr, len) {
    ptr = ptr >>> 0;
    return cachedTextDecoder.decode(getUint8Memory0().subarray(ptr, ptr + len));
}

let WASM_VECTOR_LEN = 0;

const cachedTextEncoder = (typeof TextEncoder !== 'undefined'
    ? new TextEncoder('utf-8')
    : { encode: () => { throw Error('TextEncoder not available') } } );

const encodeString = (typeof cachedTextEncoder.encodeInto === 'function'
    ? function (arg, view) {
          return cachedTextEncoder.encodeInto(arg, view);
      }
    : function (arg, view) {
          const buf = cachedTextEncoder.encode(arg);
          view.set(buf);
          return {
              read: arg.length,
              written: buf.length
          };
      });

function passStringToWasm0(arg, malloc, realloc) {
    if (realloc === undefined) {
        const buf = cachedTextEncoder.encode(arg);
        const ptr = malloc(buf.length, 1) >>> 0;
        getUint8Memory0().subarray(ptr, ptr + buf.length).set(buf);
        WASM_VECTOR_LEN = buf.length;
        return ptr;
    }

    let len = arg.length;
    let ptr = malloc(len, 1) >>> 0;

    const mem = getUint8Memory0();

    let offset = 0;

    for (; offset < len; offset++) {
        const code = arg.charCodeAt(offset);
        if (code > 0x7F) break;
        mem[ptr + offset] = code;
    }

    if (offset !== len) {
        if (offset !== 0) {
            arg = arg.slice(offset);
        }
        ptr = realloc(ptr, len, len = offset + arg.length * 3, 1) >>> 0;
        const view = getUint8Memory0().subarray(ptr + offset, ptr + len);
        const ret = encodeString(arg, view);

        offset += ret.written;
    }

    WASM_VECTOR_LEN = offset;
    return ptr;
}

This is quite painful, this glue code must be inserted in every Rust Wasm program which uses strings (which is most of them).

Rust would also benefit from being able to create JS String literals in Wasm.

Instead of going through the O(n) process at runtime of converting a Rust UTF-8 string into a JS String, we can instead just create the JS String at compile time.

We are currently using some hacks like manual string interning. But that is still slower than using compile-time string constants.

tlively commented 2 months ago

@Pauan, the interfaces in question convert between strings and WasmGC arrays of i8, not memory. Would Rust still be able to use them?

Pauan commented 2 months ago

@tlively Yes, it should be possible. Rust would have to first copy the bytes from linear memory into a GC array, and then do the conversion.

But that can still potentially be faster than using the glue code, depending on how fast the memcpy implementation is, and also depending on how optimized the decodeStringFromUTF8Array and encodeStringToUTF8Array functions are.

Of course benchmarks will have to be done to verify any performance gains.

And of course it would be even nicer if the text encoder/decoder APIs accepted a linear memory and indices instead of a GC array, because that would avoid the memcpy and GC allocation.

Pauan commented 2 months ago

Or perhaps there could be some way to convert a linear memory + indices into a GC array?

Similar to how the glue code currently uses Uint8Array.subarray to create a view of the linear memory.

But that would belong in the GC proposal, not this proposal.

eqrion commented 1 month ago

@tlively Yes, it should be possible. Rust would have to first copy the bytes from linear memory into a GC array, and then do the conversion.

But that can still potentially be faster than using the glue code, depending on how fast the memcpy implementation is, and also depending on how optimized the decodeStringFromUTF8Array and encodeStringToUTF8Array functions are.

Of course benchmarks will have to be done to verify any performance gains.

And of course it would be even nicer if the text encoder/decoder APIs accepted a linear memory and indices instead of a GC array, because that would avoid the memcpy and GC allocation.

My guess is that Rust/C++ would not benefit much from these builtins unless we added linear memory support to them. Which seems like another argument to deferring them, as it's missing some useful features.

tlively commented 1 month ago

In particular, we would at the very least want to add instructions for copying between arrays and memories if we wanted to use the APIs as-is from linear memory languages. Using the APIs with GC arrays and copying one byte at a time to/from memory is probably worse than what linear memory languages can do today.

vouillon commented 1 month ago

The wasm_of_ocaml compiler is using the text-encoder and text-decoder interfaces when available.

We are using anyref internally for JavaScript values, so we have to add conversions to/from externref. Otherwise, the API is just right for us. https://github.com/ocaml-wasm/wasm_of_ocaml/blob/83b7c68ed720ab790ecf3aa7a674e67d6f10a3e2/runtime/wasm/jsstring.wat#L58-L80

This gives us a 15% performance improvement on a significant benchmark compared to the fallback implementation which uses a buffer in the linear memory.

eqrion commented 1 month ago

I confirmed from looking at the Hoot wasm compiler source that they are not using string builtins.

When looking at implementing it in SpiderMonkey, we couldn't find a good way to implement 'measureStringAsUTF8' without significantly refactoring our UTF-8 support. That and a couple other smaller issues has meant we've not been able to implement the UTF-8 interfaces yet.

At the same time, this proposal has been sitting around for a long time with some users who would really urgently like the 'js-string' interface. I'm thinking that (unfortunately) the best thing to do right now would be to split the UTF-8 interfaces out into a 'post-MVP' followup proposal to this. This would help unblock shipping the core interface. After that we can come back to UTF-8, and possibly linear memory support too.

I plan to update the explainer in the next week or so.

vouillon commented 2 weeks ago

Note that wasm_of_ocaml is only using decodeStringFromUTF8Array and encodeStringToUTF8Array.

WebAssembly / js-string-builtins

Users of UTF-8 support #34