wasmer panic when await multiple async Host function with join

xdlin commented 1 year ago

Background I'd like to build a service with Wasm and fp-bindgen, with both import and export function as async functions. In order to make guest functions run concurrently, I'd like to involve some function like tokio::join or futures::future::join to run Host's async function concurrently in guest, but run with a panic consistantly (80%)

Here is the panic message:

thread 'tokio-runtime-worker' panicked at 'Runtime error: Cannot resolve async value: 

<some random error message>, 

/home/linxiangdong/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fp-bindgen-support-3.0.0/src/wasmer2_host/runtime.rs:30:18

The related source code

    pub fn guest_resolve_async_value(&self, async_ptr: FatPtr, result_ptr: FatPtr) {
        unsafe {
            self.__fp_guest_resolve_async_value
                .get_unchecked()
                .call(async_ptr, result_ptr)
                .expect("Runtime error: Cannot resolve async value");   //panic here
        }
    }

Quetion Is it possible to run host's async functions concurrently within guest's async function with join?

Code details My fp-bindgen protocol:

// function from host
fp_import! {
    async fn read_func1();
    async fn read_func2();
}

// function exported from WASM to host
fp_export! {
    async fn get_feature_str();
}

Guest code:

use sample_bindings::*;
use futures::future;

#[fp_bindgen_macros::fp_export_impl(sample_bindings)]
async fn get_feature_str() {
    let p = future::join(read_func1(), read_func2());
    p.await;
}

Host code:

implementation of exported functions: just two empty async functions

async fn read_func1() {
}

async fn read_func2() {
}

main function:

mod spec;

use std::fs;
use crate::spec::bindings::Runtime;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let args: Vec<String> = std::env::args().collect();
    let wasm_buf = fs::read(&args[1])?;
    let rt = Runtime::new(wasm_buf)?;

    println!("get_feature_str");
    rt.get_feature_str().await?;
    println!("get_feature_str done");

    Ok(())
}

arendjr commented 1 year ago

Thanks for the detailed report! I've tried to create a test case to reproduce it myself, but so far it's not failing for me. See: https://github.com/fiberplane/fp-bindgen/pull/195

Do you maybe see if I missed some detail from your report?

arendjr commented 1 year ago

Oh, and maybe it matters which tokio features you have enabled. I tried with both rt and rt-multi-thread thus far...

xdlin commented 1 year ago

Hi @arendjr , thanks a lot for your quick response!

After some further investigation, I finally find out the core difference between your test case in #195 and my code:

It has nothing to do with tokio features: I enabled full which include the rt-multi-thread, after setting these features exactally following yours, the result remains the same
I finally managed to reproduce the bug with your test code: instread of calling it from #[tokio::test], put it in #[tokio::main] will reproduce this issue:

The code for fp-bindgen/examples/example-rust-wasmer2-runtime/src/main.rs

mod wasi_spec;
use anyhow::Result;
use crate::wasi_spec::bindings::Runtime;

use std::sync::Mutex;

pub static GLOBAL_STATE: Mutex<u32> = Mutex::new(0);

const WASM_BYTES: &'static [u8] =
    include_bytes!("../../example-plugin/target/wasm32-wasi/debug/example_plugin.wasm");

fn new_runtime() -> Result<Runtime> {
    let rt = Runtime::new(WASM_BYTES)?;
    rt.init()?;
    Ok(rt)
}

#[tokio::main]
async fn main() -> Result<()> {
    println!("Hello, world!");
    let rt = new_runtime()?;

    let response = rt.make_parallel_requests().await?;

    // assert_eq!(response, r#"{"status":"confirmed"}"#.to_string());
    assert_eq!(response, response);
    Ok(())
}

So I guess the issue happens with some Tokio related init/uninit steps? Woud you please help to verify it? (Maybe I made some silly mistake here since I'm a Rust newbie, please point it out if possible)

xdlin commented 1 year ago

I think it has something to do with multithreading:

By setting

#[tokio::test(flavor = "multi_thread")]

I can reproduce this panic with test

If I set main to

#[tokio::main(flavor = "current_thread")]

It also works as expected.

Hopefully that could narrow down the issue a little bit

arendjr commented 1 year ago

Thank you so much! This points into a direction I was already having suspicions about. I have not yet fully confirmed this, but I highly suspect the issue is that callbacks from Tokio can come from any thread, whereas the code running inside WebAssembly assumes a single-threaded context. That assumption is violated if you call back into the sandbox from multiple threads simultaneously, which may happen if parallel requests are triggered inside the sandbox and the callbacks to those can come back from multiple threads at once. At least, that would explain why Tokio's "current_thread" flavor has no problems and why I hadn't run into this before without parallel requests.

I had always made the assumption that Wasmer would be responsible for synchronizing callbacks back into the sandbox, but apparently it doesn't do this. I suspect the migration to Wasmer 3 might resolve this, since it changes the Send + Sync properties of the runtime, which would probably force us to solve this properly. But I have to look into it a bit more to confirm that's really the case.

At least we have a work-around now, and I have some more pointers to attempt a real fix. Thanks again!

xdlin commented 1 year ago

Thanks a lot for your quick response again. I'm glad the info I gave could provide some help. I know you guys are busying migrating to Wasmer3 (that's a great news), I'm looking forward to see that happen in the near future~

Currently I can play with the work-around and continue my development, looking forward to hear your good news soon!

xdlin commented 9 months ago

I suspect the migration to Wasmer 3 might resolve this, since it changes the Send + Sync properties of the runtime

Hi @arendjr I tried to fix this issue with a local modification fp-bindgen on my own, and I did upgrade to Wamser 4 (based on the work Roy Jacobs roy.jacobs@gmail.com did on wasmer3 branch), with a quick and dirty hack, I made it work just like what it is for wasmer2:

we have the same expected behavior with single thread Tokio runtime
we have the same panic with multi-thread Tokio runtime

One issue with Wasmer4 is FunctionEnvMut is not Send, nor is its underlying Store, so it's hard to pass it to tokio::spawn which is how async host function is implemented, so I have to involve unsafe to archive that, after I doing more test and cleaning up the code, I'll submit a MR here later.

But that's not my key point here, my question is, even if we make it work with Wamser4, it seems like we still have to async function panic issue, what further action should I take, try to improve Wasmer4 related changes, or try another direction?

By saying 'another direction', I have a wild guess: by checking the file fp-bindgen-support/src/guest/async/task.rs, which says it's a modified version from https://github.com/rustwasm/wasm-bindgen/blob/master/crates/futures/src/task/singlethread.rs . Looking through wasm-bindgen repo, there is also a multi thread version task: https://github.com/rustwasm/wasm-bindgen/blob/main/crates/futures/src/task/multithread.rs, so will migrate to this multithread task a possible solution for this issue?

xdlin commented 9 months ago

One issue with Wasmer4 is FunctionEnvMut is not Send Just FYI: An issue on wasmer : https://github.com/wasmerio/wasmer/issues/3482 And a discussion on SO: https://stackoverflow.com/questions/75753403/wasmer-host-functions-accessing-memory

arendjr commented 9 months ago

I may need to spend some more time to look deeper into this issue, but from my cursory understanding, when you say:

One issue with Wasmer4 is FunctionEnvMut is not Send, nor is its underlying Store, so it's hard to pass it to tokio::spawn which is how async host function is implemented, so I have to involve unsafe to archive that, after I doing more test and cleaning up the code, I'll submit a MR here later.

I would suspect the use of unsafe here to be the reason why the crash can be reproduced with Wasmer4. I think in Wasmer2 it was a mistake for the Store to be Send, because it made us think it was safe to invoke functions from multiple threads, which was never the case. Now they’ve made the API more strict, which correctly reveals that we shouldn’t. We can work around it with unsafe, but that just puts the blame on us for triggering the panic.

What I think is the right approach here, is for us to implement some mechanism to run the WASM environment in a single thread (may be the one from which it is created, or a dedicated one) and use a channel to make sure callbacks are all processed by that same thread.

I haven’t yet looked at task/multithread.rs to know if it might do something like this for us.

xdlin commented 2 months ago

hi @arendjr I followed your advice and dis some resarch, and come up with a primitive fix: it does have performance issue, but at least I could get correct results in multi-thread tokio runtime without any panic.

Would you please help to review this change? And feel free to copy it partially if you some pieces are useful but unable to merge it due to whatever reason. If you believe that's the right direction, I could spend more efforts to improve it.

the PR: https://github.com/fiberplane/fp-bindgen/pull/210

arendjr commented 2 months ago

Sorry, I was on holiday. I no longer work at Fiberplane, so I can't really help with this anymore.

fiberplane / fp-bindgen

wasmer panic when await multiple async Host function with join #194