Open xdlin opened 1 year ago
Thanks for the detailed report! I've tried to create a test case to reproduce it myself, but so far it's not failing for me. See: https://github.com/fiberplane/fp-bindgen/pull/195
Do you maybe see if I missed some detail from your report?
Oh, and maybe it matters which tokio
features you have enabled. I tried with both rt
and rt-multi-thread
thus far...
Hi @arendjr , thanks a lot for your quick response!
After some further investigation, I finally find out the core difference between your test case in #195 and my code:
tokio
features: I enabled full
which include the rt-multi-thread
, after setting these features exactally following yours, the result remains the same#[tokio::test]
, put it in #[tokio::main]
will reproduce this issue:The code for fp-bindgen/examples/example-rust-wasmer2-runtime/src/main.rs
mod wasi_spec;
use anyhow::Result;
use crate::wasi_spec::bindings::Runtime;
use std::sync::Mutex;
pub static GLOBAL_STATE: Mutex<u32> = Mutex::new(0);
const WASM_BYTES: &'static [u8] =
include_bytes!("../../example-plugin/target/wasm32-wasi/debug/example_plugin.wasm");
fn new_runtime() -> Result<Runtime> {
let rt = Runtime::new(WASM_BYTES)?;
rt.init()?;
Ok(rt)
}
#[tokio::main]
async fn main() -> Result<()> {
println!("Hello, world!");
let rt = new_runtime()?;
let response = rt.make_parallel_requests().await?;
// assert_eq!(response, r#"{"status":"confirmed"}"#.to_string());
assert_eq!(response, response);
Ok(())
}
So I guess the issue happens with some Tokio related init/uninit steps? Woud you please help to verify it? (Maybe I made some silly mistake here since I'm a Rust newbie, please point it out if possible)
I think it has something to do with multithreading:
By setting
#[tokio::test(flavor = "multi_thread")]
I can reproduce this panic with test
If I set main to
#[tokio::main(flavor = "current_thread")]
It also works as expected.
Hopefully that could narrow down the issue a little bit
Thank you so much! This points into a direction I was already having suspicions about. I have not yet fully confirmed this, but I highly suspect the issue is that callbacks from Tokio can come from any thread, whereas the code running inside WebAssembly assumes a single-threaded context. That assumption is violated if you call back into the sandbox from multiple threads simultaneously, which may happen if parallel requests are triggered inside the sandbox and the callbacks to those can come back from multiple threads at once. At least, that would explain why Tokio's "current_thread" flavor has no problems and why I hadn't run into this before without parallel requests.
I had always made the assumption that Wasmer would be responsible for synchronizing callbacks back into the sandbox, but apparently it doesn't do this. I suspect the migration to Wasmer 3 might resolve this, since it changes the Send + Sync
properties of the runtime, which would probably force us to solve this properly. But I have to look into it a bit more to confirm that's really the case.
At least we have a work-around now, and I have some more pointers to attempt a real fix. Thanks again!
Thanks a lot for your quick response again. I'm glad the info I gave could provide some help. I know you guys are busying migrating to Wasmer3 (that's a great news), I'm looking forward to see that happen in the near future~
Currently I can play with the work-around and continue my development, looking forward to hear your good news soon!
I suspect the migration to Wasmer 3 might resolve this, since it changes the
Send + Sync
properties of the runtime
Hi @arendjr I tried to fix this issue with a local modification fp-bindgen on my own, and I did upgrade to Wamser 4 (based on the work Roy Jacobs roy.jacobs@gmail.com did on wasmer3 branch), with a quick and dirty hack, I made it work just like what it is for wasmer2:
One issue with Wasmer4 is FunctionEnvMut
is not Send
, nor is its underlying Store
, so it's hard to pass it to tokio::spawn
which is how async host function is implemented, so I have to involve unsafe
to archive that, after I doing more test and cleaning up the code, I'll submit a MR here later.
But that's not my key point here, my question is, even if we make it work with Wamser4, it seems like we still have to async function panic issue, what further action should I take, try to improve Wasmer4 related changes, or try another direction?
By saying 'another direction', I have a wild guess: by checking the file fp-bindgen-support/src/guest/async/task.rs
, which says it's a modified version from https://github.com/rustwasm/wasm-bindgen/blob/master/crates/futures/src/task/singlethread.rs
. Looking through wasm-bindgen repo, there is also a multi thread version task: https://github.com/rustwasm/wasm-bindgen/blob/main/crates/futures/src/task/multithread.rs
, so will migrate to this multithread task
a possible solution for this issue?
One issue with Wasmer4 is
FunctionEnvMut
is notSend
Just FYI: An issue onwasmer
: https://github.com/wasmerio/wasmer/issues/3482 And a discussion on SO: https://stackoverflow.com/questions/75753403/wasmer-host-functions-accessing-memory
I may need to spend some more time to look deeper into this issue, but from my cursory understanding, when you say:
One issue with Wasmer4 is
FunctionEnvMut
is notSend
, nor is its underlyingStore
, so it's hard to pass it totokio::spawn
which is how async host function is implemented, so I have to involveunsafe
to archive that, after I doing more test and cleaning up the code, I'll submit a MR here later.
I would suspect the use of unsafe
here to be the reason why the crash can be reproduced with Wasmer4. I think in Wasmer2 it was a mistake for the Store
to be Send
, because it made us think it was safe to invoke functions from multiple threads, which was never the case. Now they’ve made the API more strict, which correctly reveals that we shouldn’t. We can work around it with unsafe
, but that just puts the blame on us for triggering the panic.
What I think is the right approach here, is for us to implement some mechanism to run the WASM environment in a single thread (may be the one from which it is created, or a dedicated one) and use a channel to make sure callbacks are all processed by that same thread.
I haven’t yet looked at task/multithread.rs
to know if it might do something like this for us.
hi @arendjr I followed your advice and dis some resarch, and come up with a primitive fix: it does have performance issue, but at least I could get correct results in multi-thread tokio runtime without any panic.
Would you please help to review this change? And feel free to copy it partially if you some pieces are useful but unable to merge it due to whatever reason. If you believe that's the right direction, I could spend more efforts to improve it.
Sorry, I was on holiday. I no longer work at Fiberplane, so I can't really help with this anymore.
Background I'd like to build a service with Wasm and fp-bindgen, with both import and export function as async functions. In order to make guest functions run concurrently, I'd like to involve some function like
tokio::join
orfutures::future::join
to run Host's async function concurrently in guest, but run with a panic consistantly (80%)Here is the panic message:
The related source code
Quetion Is it possible to run host's async functions concurrently within guest's async function with
join
?Code details My fp-bindgen protocol:
Guest code:
Host code:
implementation of exported functions: just two empty async functions
main function: