fzyzcjy / flutter_rust_bridge

Flutter/Dart <-> Rust binding generator, feature-rich, but seamless and simple.
https://fzyzcjy.github.io/flutter_rust_bridge/
MIT License
3.61k stars 254 forks source link

wasm 'RuntimeError: Atomics.wait cannot be called in this context' #1910

Open aran opened 2 weeks ago

aran commented 2 weeks ago

Describe the bug

I have a rust crate "other" and a type within it "LocalEditor". In my flutter_rust_bridge target crate, I have use other::LocalEditor. In the flutter_rust_bridge crate, I have a method pub fn uses_local_editor(editor: &mut<LocalEditor>) -> Option<Vec<u8>>{...}

In dart I have an async function 'doesStuff' and I call doesStuff() with no await so it runs independently in the main event loop. It parses some data in Rust and eventually calls notifyListeners() if applicable.

doesStuff has: final value = await usesLocalEditor(editor: anEditor);

I sometimes get, but it's not reliablie or frequent, and I can't repro:

══╡ EXCEPTION CAUGHT BY MAIN ZONE ╞═════════════════════════════════════════════════════════════════
The following NativeError object was thrown:
  RuntimeError: Atomics.wait cannot be called in this context

When the exception was thrown, this was the stack:
pkg/rust_lib.js 323:10                                               rust_arc_increment_strong_count_RustOpaque_flutter_rust_bridgefor_generatedrust_asyncRwLockLocalEditor
packages/stuff/src/rust/frb_generated.web.dart 634:12                rust_arc_increment_strong_count_RustOpaque_flutter_rust_bridgefor_generatedrust_asyncRwLockLocalEditor
packages/flutter_rust_bridge/src/rust_arc/_common.dart 32:38         clone
packages/flutter_rust_bridge/src/misc/rust_opaque.dart 51:52         cstEncode
packages/flutter_rust_bridge/src/misc/rust_opaque.dart 57:34         sseEncode
packages/stuff/src/rust/frb_generated.dart 1769:26                   sse_encode_Auto_RefMut_RustOpaque_flutter_rust_bridgefor_generatedrust_asyncRwLockLocalEditor
packages/stuff/src/rust/frb_generated.dart 372:9                     <fn>
<snip>

This is on up-to-date Chrome with up-to-date FRB. No reliable repro, it doesn't happen every time, but I wanted to report in case the stack trace / error message are something you're familiar with.

Steps to reproduce

N/A

Logs

N/A

Expected behavior

No response

Generated binding code

No response

OS

No response

Version of flutter_rust_bridge_codegen

No response

Flutter info

No response

Version of clang++

No response

Additional context

No response

fzyzcjy commented 2 weeks ago

Hmm, I quickly searched it and see https://github.com/WebAssembly/threads/issues/106.

Quick guess (need experiments to verify!): Is it because that, rust_arc_increment_strong_count (calls Arc to increment count) once in a while needs to do Atomics.wait (because of concurrency?), and if that is done on main thread, webassembly is unhappy about that.

Maybe we can firstly reproduce this by creating a minimal Rust code that calls Atomics.wait deliberately, and tries to call it on Dart main thread and see what is going on.

aran commented 2 weeks ago

I confirmed that if I convert my data structure that is involved in this from RustAutoOpaque to translatable, the issue no longer occurs (or maybe occurs much more rarely). When it is RustAutoOpaque, possibly because of nondeterminism in my app's behavior, it shows up every few refreshes.

From reading that thread and a few others around it, it's not safe for any FRB code to call any Atomics.wait-using code, unless there's a way to mark some FRB code as "only off main thread" and enforce that. Other ways would be to explicitly drop multithreading and +atomics, or take out '&var' support and insist on only pure owned data transfers, &mut, or pure copies, but no locking until the Rust atomics story is cleared up.

fzyzcjy commented 2 weeks ago

it's not safe for any FRB code to call any Atomics.wait-using code, unless there's a way to mark some FRB code as "only off main thread" and enforce that

Yes I think so, due to limitations of web assembly. That limitations holds only for Web, and non-Web platforms are always happy though.

unless there's a way to mark some FRB code as "only off main thread" and enforce that

Firstly, for user code, as long as you do not use sync mode, i.e. use the default mode, I guess you are safe. This is because the default mode uses a thread pool, thus no main thread is blocked.

However, for this specific bug report, it is because of rust_arc_increment_strong_count (which calls Arc::increment_strong_count, which seems to be called in main thread. Then maybe we have some choices to fix it:

  1. When in web, make the real increment-strong-count happen in non-main thread. For example, fn rust_arc_increment_strong_count() { execute_in_thread_pool(|| Arc::increment_strong_count()) }.
  2. Does there exist an Arc that does not use Atomics.wait in Web? I think there can be one, since e.g. it can choose to spin instead of calling Atomics.wait. I guess we can ask this on Rust forum etc.

Feel free to PR, alternatively I will try to squeeze out time working on it!

aran commented 2 weeks ago

A couple other notes, without solving anything:

aran commented 2 weeks ago

Repro: https://github.com/aran/frb-exc-repro/blob/mut/lib/main.dart

aran commented 2 weeks ago

After some more thought, it's clearer now: When targeting wasm there is no safe way to use standard concurrency primitives that ultimately depend on Atomics.wait.

In particular any frb(sync) function cannot wait on any lock/mutex because all lock/mutex code will ultimately need either Atomics.wait or an async call out. Any async function has to cope with being async but not having access to concurrency control primitives. The current behavior of automatically wrapping in an RwLock doesn't work for several cases.

Assuming we can eliminate the automatic RwLock wrapper, you can safely use frb(sync) with:

  1. types that are translatable types all the way down and contain no synchronized data.
  2. translatable types containing synchronized data as long the sync function doesn't use the synchronized data
  3. In both cases reading is always safe if the data is globally immutable. Reading and writing would also both be safe if the data never gets sent to any other thread / web worker. In particular, it's safe if either all calls touching the data are sync, or if all async calls are restricted in their execution to the main thread. This main-thread-async could be supported by another frb annotation.

It can never be made safe to mix frb(sync) with async methods accessing the same data from the Rust thread pool.

Now for async with &mut or owned mut- the story there is even harder. To write safe async code that uses &mut references to data, you have to guarantee at the Flutter layer that it's a globally unique reference, no matter which thread. That's very hard to guarantee at the Flutter layer without Rust's type system to help out. There might some hacks to help like a runtime ownership tracking API at the Flutter level to catch issues, or a main-thread-only busywait implementation, because it's hard to stop flutter code from calling asyncMutableRustAPI(data); asyncMutableRustAPI(data);.

Finally, auto-wrapping in an RwLock only works for data where all accesses are async and sent to the web worker thread pool.

I could see a few different ways of supporting this from the FRB layer. One way would be allow users finer-grained control of the RustAutoOpaque wrapping to allow avoiding the automatic RwLock. Another helpful feature would be allow running an async call on the main thread event loop rather than the thread pool. Then for some data that had a frb(sync) call with an &mut on some data, you could allow other async methods that take & references to it to run on the main thread event loop, which would still be safe.

aran commented 2 weeks ago

With no further changes to FRB, the simplest way to make Rust mutable data safe might be to choose either async or sync globally. Either run everything in the thread pool where RwLock works, or run everything sync where Atomic.wait will never be needed.

fzyzcjy commented 2 weeks ago

Another helpful feature would be allow running an async call on the main thread event loop rather than the thread pool.

Do not remember very clearly, but in web it seems we use the main thread and web primitives for async, so it is not in a thread pool. It is the non-sync non-async fn that uses thread pool IMHO.

@fzyzcjy When in web, make the real increment-strong-count happen in non-main thread. For example, fn rust_arc_increment_strong_count() { execute_in_thread_pool(|| Arc::increment_strong_count()) }.

Elaborate this a little bit: We can convert arbitrary main-thread code into non-main-thread, by letting main thread runs function body in non-main-thread, and main thread just do busy wait (if other waiting primitive is not allowed). This may not be harmful (except for overhead of changing thread), because even without this change, the main thread will be blocked by exactly the same amount of time.

aran commented 2 weeks ago

Apologies for the lack of clarity - in all of the above, when talking about async functions, I only meant non-async Rust functions that are not labeled frb(sync) and are compiled to async Dart functions. For these functions, it would be useful to have options to instruct FRB to avoid wrapping data in locks and to send these to the main thread event loop rather than thread pool.

I'm not sure of the right tradeoffs between main thread busy-waits and other approaches. At first glance it would seem unfortunate to be forced into main thread busy-waits because it could result in surprising performance behavior.

fzyzcjy commented 1 week ago

More information why increment_strong_count utilizes Atomics.wait: It seems that you are using the quickstart mode. Try to set full_dep: true and it will use the real std arc which should not use Atomics IMHO.

fzyzcjy commented 1 week ago

Btw, as is mentioned in your another post, another simple way to avoid this issue may be to use async fn (i.e. Rust async functions) and #[frb(sync)] fn (i.e. Dart and Rust sync functions), which in Web only uses the main thread (while uses multi thread in native.

aran commented 1 week ago

Up to you how you want to track full solutions. For my case I am working around issues by exclusively using frb(sync) and I've turned on full_dep although I'm not sure I understand the full implications of that or why there are different modes with different correctness properties.