Vrtgs / thirtyfour

Selenium WebDriver client for Rust, for automated testing of websites
Other
1.06k stars 78 forks source link

Spawn a task in `Drop` instead of `block_on` #248

Closed SabrinaJewson closed 3 weeks ago

SabrinaJewson commented 3 weeks ago

Related to #19. Currently, this library runs block_on in Drop. This is incorrect and a bug, as it can cause deadlocks.

Instead, the library should attempt to spawn a task so that quit() can run in the background. At least for WebDriver itself, the SessionHandle is in an Arc so this is possible.

But even better than that would be to refactor SessionHandle to always contain an internal Arc:

#[derive(Clone)]
pub struct SessionHandle(Arc<Inner>);

struct Inner {
    client: Arc<dyn HttpClient>,
    server_url: Url,
    // etc
}

Store a SessionHandle instead of an Arc<SessionHandle> in WebDriver. In impl Drop for SessionHandle, it can spawn a task and not block_on.

I see you’re using futures::executor::block_on to block in Drop, and this is presumably to work around the panic you get in Tokio. But the panic exists for a reason, because it’s an indicator of behaviour that is always incorrect. Working around it was futures::executor::block_on is like using unsafe to get around the limitation that borrowck won’t allow you to make two &muts to the same place. Granted, here it’s not insta-UB, but it’s always incorrect.

SabrinaJewson commented 3 weeks ago

I apologize for not reading through #7 – I understand that spawning a task isn’t always desirable, because it can cause the quitting function not to finish.

removedOne thing you can do is spawn a `spawn_blocking` task: ```rs let handle = tokio::runtime::Handle::current(); tokio::spawn_blocking(move || { handle.block_on(async { /* … */ }).unwrap() }); ``` This will prevent the runtime from _actually_ exiting until the spawned task has completed, as `impl Drop for Runtime` will wait for `spawn_blocking` tasks.

Edit: I was mistaken – because dropping a runtime immediately shuts down I/O resources, the future you’re blocking on will just cancel.

My remaining suggestion is to use block_in_place if the runtime is multi-threaded (which can be checked via runtime::Handle::runtime_flavor), and simply leak otherwise. I think that in general, a leak is preferable to a deadlock.

Vrtgs commented 3 weeks ago

no I do not in fact use any workarounds I wait in place, and I block in place if possible, I understand the confusion that you may have had when seeing block_on but its an internal api and I also use it for tests and thats why I named it that 🙃, also the ONLY way this can cause deadlocks is if the HTTP client relies on a spawned task on the main thread which the default doesnt, but I should mention that in the docs you do make a point from /src/support.rs

/// Helper to run the specified future and block the current thread waiting for the result.
/// works even while in a tokio runtime
pub fn block_on<F>(future: F) -> F::Output
where
    F: Future + Send,
    F::Output: Send,
{
    fn no_unwind<T>(f: impl FnOnce() -> T) -> T {
        let res = std::panic::catch_unwind(AssertUnwindSafe(f));

        res.unwrap_or_else(|_| {
            struct Abort;
            impl Drop for Abort {
                fn drop(&mut self) {
                    eprintln!("unrecoverable error reached aborting...");
                    std::process::abort()
                }
            }

            let _abort_on_unwind = Abort;
            unreachable!("thirtyfour global runtime panicked")
        })
    }

    static GLOBAL_RT: LazyLock<tokio::runtime::Handle> = LazyLock::new(|| {
        no_unwind(|| {
            let rt = tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap();
            let handle = rt.handle().clone();

            // drive the runtime
            // we do this so that all calls to GLOBAL_RT.block_on() work
            thread::spawn(move || -> ! {
                async fn forever() -> ! {
                    match std::future::pending::<Infallible>().await {}
                }

                no_unwind(move || rt.block_on(forever()))
            });
            handle
        })
    });

    macro_rules! block_global {
        ($future:expr) => {
            thread::scope(|scope| match scope.spawn(|| GLOBAL_RT.block_on($future)).join() {
                Ok(res) => res,
                Err(panic) => std::panic::resume_unwind(panic),
            })
        };
    }

    cfg_if::cfg_if! {
        if #[cfg(feature = "tokio-multi-threaded")] {
            use tokio::runtime::RuntimeFlavor;

            match tokio::runtime::Handle::try_current() {
                Ok(handle) if handle.runtime_flavor() == RuntimeFlavor::MultiThread => {
                    tokio::task::block_in_place(|| handle.block_on(future))
                }
                _ => block_global!(future),
            }
        } else {
            block_global!(future)
        }
    }
}
SabrinaJewson commented 3 weeks ago

I see, thanks for the clarification. But that is still blocking in Drop – if you’re goïng to be running a separate runtime anyway, maybe you could have a spawn_blocking task that runs the code in the global Tokio runtime? That way, the main runtime will be unable to exit until that blocking task completes, but since the global runtime won’t have shut down yet, I/O resources are still able to be used.

Though on a side note, I’m rather surprised if that actually works – I don’t believe that in general I/O resources from one tokio runtime can be used from another one like that. Doesn’t it panic?

Vrtgs commented 3 weeks ago

maybe you could have a spawn_blocking

so block on has the signature

pub fn block_on<F: Future>(&self, future: F) -> F::Output
pub fn spawn_blocking<F, R>(f: F) -> JoinHandle
where
    F: FnOnce() + 'static,
    R: Send + 'static,

note that JoinHandle has no way of blocking on to wait for it synchronously you can only await it that being said you can still work around this fact using a oneshot std channel but the biggest deal is it requires F to be static

Though on a side note, I’m rather surprised if that actually works – I don’t believe that in general I/O resources from one tokio runtime can be used from another one like that. Doesn’t it panic?

this is kinda complicated and gets into internals but the TLDR is it works, when you poll the future (after transfer) and since you didn't poll it before using a different runtime the Waker that gets registered is the global runtime's and the future just uses the global runtimes resources no panic here

SabrinaJewson commented 3 weeks ago

note that JoinHandle has no way of blocking on to wait for it synchronously you can only await it

The point is to throw away the join handle, becuase blocking in Drop in a runtime without block_in_place is a bug. The use of the spawn_blocking task ensures the runtime does not actually exit until the closure is complete.

the ONLY way this can cause deadlocks is if the HTTP client relies on a spawned task on the main thread which the default doesnt

Is that true? The I/O resources used by the HTTP cliënt are still bound to the runtime, and if the runtime is blocked it can’t process events given by epoll, hence the HTTP connection cannot be driven forward. I was able to produce deadlocks with this code:

use thirtyfour::FirefoxCapabilities;
use thirtyfour::WebDriver;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    WebDriver::new("http://localhost:4444", FirefoxCapabilities::new())
        .await
        .unwrap();
}

this is kinda complicated and gets into internals but the TLDR is it works, when you poll the future (after transfer) and since you didn't poll it before using a different runtime the Waker that gets registered is the global runtime's and the future just uses the global runtimes resources no panic here

That’s not what I was asking. But I did some testing and the way it works is that while Tokio I/O resources cannot dynamically change the runtime they use internally, it’s perfectly okay to use the I/O resources bound to runtime A inside runtime B, so long as runtime A is not shut down.

Which is essentially to say: if this strategy works, you’re blocking the thread in an async context, which is wrong. That said, I’ve learnt something new about the situation, so let me update my suggestions:

First of all, in the multi-threaded case, block_in_place is fine.

In the single-threaded case, blocking at all in Drop is a bug. What can be done instead is to start a spawn_blocking task in the Drop implementation (and not wait for it to complete). This ensures that the outer runtime will not quit until the closure passed into spawn_blocking completes. However, the I/O resources may have shut down in this time; so what we need to do is creäte a new runtime inside the spawn_blocking call and creäte a new reqwest::Client in this code path.

The purpose of making a wholly new reqwest::Client is to ensure its I/O resources are bound to the new runtime, not the old runtime (that might be shutting down).

Vrtgs commented 3 weeks ago

That’s not what I was asking. But I did some testing and the way it works is that while tokio I/O resources cannot dynamically change the runtime they use internally, it’s perfectly okay to use the I/O resources bound to runtime A inside runtime B, so long as runtime A is not shut down.

Yea, and again if nothing is bound yet it works out, as I said its complicated and really gets down to how the runtimes work

Which is essentially to say: if this strategy works, you’re blocking the thread in an async context, which is wrong.

In the single-threaded case, blocking at all in Drop is a bug.

this is just wrong, you can block any async executor any time any day in any case, because you do that if you do arithmetic wait on a synchronous mutex, or any operation that isn't the future returning Poll::Pending, its just that in most cases you shouldn't block for a long time, like when you do fs access or network calls, it impacts performance a lot, but like this is just a web driver automation crate most users wouldn't care if they blocked a little bit when using a web driver

What can be done instead is to start a spawn_blocking task in the Drop implementation (and not wait for it to complete).

this code runs fine and finishes doesn't stop for ever, why? because the blocking task has not been scheduled yet, and guess what

#[tokio::main]
async fn main() {
    tokio::task::spawn_blocking(move || {
        std::thread::park();
    });
}

and this is really common to be dropping the Webdriver at the end of ur program, so its hard to guarantee it got scheduled

and guess what doing something like this still blocks the runtime for an an unknown amount of time as we don't know when the blocking pool is going to be free

#[tokio::main]
async fn main() {
    let (tx, rx) = std::sync::mpsc::sync_channel(0);
    tokio::task::spawn_blocking(move || {
        let _ = tx.send(());
        /* rest of the code */
    });

    rx.recv().expect("could not spawn task");
}

And this doesn't even address the fact that the future has to be 'static

in short AsyncDrop is complicated really, and just blocking on works perfectly fine I don't expect to be making LOTS of these, and I really don't expect a lot of people to not be closing them if they are making lots of web drivers, so as long as the HttpClient doesn't await something running on the main runtime, and btw things spawned using LocalSet or join! in a multi-threaded runtime, which is a very uncommon thing to do in an http client, and isn't done by the default one it works

the only other option is to duplicate a lot of code and using the ureq crate for http for a truly sync drop, or using a janky solution that requires 2 Arc indirections, and a lot of refactoring to drop (no you cannot use the Weak returned by Arc::new_cyclic) while the object is being dropped upgrade will return None), that I guess usually blocks for a tiny bit less, but for what sake?? we change a lot of code and its impacted negatively

Vrtgs commented 3 weeks ago

Is that true? The I/O resources used by the HTTP cliënt are still bound to the runtime, and if the runtime is blocked it can’t process events given by epoll, hence the HTTP connection cannot be driven forward. I was able to produce deadlocks with this code:

use thirtyfour::FirefoxCapabilities;
use thirtyfour::WebDriver;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    WebDriver::new("http://localhost:4444", FirefoxCapabilities::new())
        .await
        .unwrap();
}

okay sheesh this is embarassing my bad bro 🫠 I didn't know it would block this doesn't work yeah never mind I have genuinely no idea how this wasn't caught in CI tests dont use quit most of the time