emscripten-core / emscripten

Emscripten: An LLVM-to-WebAssembly Compiler
Other
25.73k stars 3.3k forks source link

Problems with garbled strings when using embind. "Invalid UTF-8 leading byte..." #18739

Open ognjentodic opened 1 year ago

ognjentodic commented 1 year ago

Version of emscripten/emsdk:

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.30 (cfe2bdfe2692457cb5f5770672f6e5ccb3ffc2f2)
clang version 16.0.0 (https://github.com/llvm/llvm-project 800f0f1546b2352ba42a4777149afb13cb874fcd)
Target: wasm32-unknown-emscripten
Thread model: posix

Full link command and output with -v appended:

"/Users/username/PROJECTS/emsdk/upstream/bin/wasm-ld" -o ./src/modules/node/ourlibrary-node.wasm lib/ourlibrary-bridge.o lib/ourlibrary.a -L/Users/username/PROJECTS/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten /Users/interrupt/username/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/crtbegin.o --whole-archive -lembind-rtti --no-whole-archive -lGL-mt -lal -lhtml5 -lstubs-debug -lnoexit -lc-mt-debug -ldlmalloc-mt -lcompiler_rt-mt -lc++-mt -lc++abi-debug-mt -lsockets-mt -lubsan_rt-mt -lsanitizer_common_rt-mt -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-cxx-exceptions -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --import-undefined --import-memory --shared-memory --strip-debug --export-if-defined=main --export-if-defined=_emscripten_thread_init --export-if-defined=_emscripten_thread_exit --export-if-defined=_emscripten_thread_crashed --export-if-defined=_emscripten_tls_init --export-if-defined=pthread_self --export-if-defined=__start_em_asm --export-if-defined=__stop_em_asm --export-if-defined=__start_em_lib_deps --export-if-defined=__stop_em_lib_deps --export-if-defined=__start_em_js --export-if-defined=__stop_em_js --export-if-defined=__main_argc_argv --export-if-defined=fflush --export-if-defined=emscripten_stack_get_end --export-if-defined=emscripten_stack_get_free --export-if-defined=emscripten_stack_get_base --export-if-defined=emscripten_stack_get_current --export-if-defined=emscripten_stack_init --export-if-defined=stackSave --export-if-defined=stackRestore --export-if-defined=stackAlloc --export-if-defined=__wasm_call_ctors --export-if-defined=__errno_location --export-if-defined=emscripten_dispatch_to_thread_ --export-if-defined=_emscripten_thread_free_data --export-if-defined=emscripten_main_browser_thread_id --export-if-defined=emscripten_main_thread_process_queued_calls --export-if-defined=emscripten_run_in_main_runtime_thread_js --export-if-defined=emscripten_stack_set_limits --export-if-defined=getTempRet0 --export-if-defined=setTempRet0 --export-if-defined=__get_temp_ret --export-if-defined=__set_temp_ret --export-if-defined=memalign --export-if-defined=emscripten_builtin_memalign --export-if-defined=emscripten_builtin_malloc --export-if-defined=emscripten_builtin_free --export-if-defined=malloc --export-if-defined=free --export-if-defined=__cxa_is_pointer_type --export-if-defined=__cxa_can_catch --export-if-defined=setThrew --export-if-defined=__cxa_free_exception --export-if-defined=ntohs --export-if-defined=htons --export-if-defined=__dl_seterr --export-if-defined=saveSetjmp --export-table -z stack-size=65536 --initial-memory=314572800 --no-entry --max-memory=2147483648 --stack-first

We use embind to bring over C++ class to JS. When doing this we are running into “Invalid UTF-8 leading byte *** encountered when deserializing a UTF-8 string in wasm memory to a JS string! ”

Here is a synopsis of the problem:

1) we have a class Result that has a method text() 2) we use embind::class to register this class 3) we link a separately built wasm library that has a background thread 4) in our top level wasm, we register a callback (required by the library) that's called from a library background thread 5) our callback takes a pointer to an instance of our class as an argument. E.g. callback(Result *result) 6) inside the callback, we pass the pointer argument to EM_ASM and then pass that pointer through postMessage 7) we support both browser and node.js, and our main thread receives the message event (the issue can be replicated in both browser and note environments) 8) the event callback calls an embind custom constructor method in our wasm that takes a uint_ptr 9) we reinterpret_cast the uint_ptr to be the embind type and return the pointer

void callback(const Result *result) {

//   when printing here, the text is not garbled
  EM_ASM({
    console.log("partial result on worker");
    var message = {};
    var mylib = {};
    message.mylib = mylib;
    message.mylib.cmd = "MyLib::PartialResult";
    message.mylib.result = $0;
    postMessage(message);
  }, result);
}

emscripten::class_<Result>("Result")
    .constructor()
    .constructor(&getResultFromPointer, allow_raw_pointers())
    .function("text", &Result::Text)
    ;

Result* getResultFromPointer(uintptr_t result) {
  Result* res = reinterpret_cast<Result *>(result);
  // cout << res->Text() << endl;
  return res;
}

The library actually has two different callbacks, one that's called frequently (every 200ms or so) and another one that's called only once "per run". With the first one we can reliably reproduce the error, the text on JS side has a number of weird characters. With the latter it doesn't happen that often, but it does happen.

If we pass only the text (i.e. not a custom Result class, but just a string), there is no problems.

Originally we were trying to post async to the main thread (via MAIN_THREAD_ASYNC_EM_ASM macro) which was also hitting the same problem, and then we switched to posting to a separate worker via EM_ASM.

Some potentially relevant issues: https://github.com/emscripten-core/emscripten/issues?q=Invalid+UTF-8+leading+byte+ https://github.com/emscripten-core/emscripten/issues/13194 https://github.com/emscripten-core/emscripten/issues/18081

We can try to create a small PoC that showcases the problem, but first we wanted to see if somebody might be able to provide insights based on the description above.

ognjentodic commented 1 year ago

I should also add that these callbacks need to execute quickly (not to block the background thread), thus initial try with async macro and then posting to a separate worker.