Actix-web hangs forever if request processing takes a long time.

moxian commented 5 years ago

The code (for actix-web-0.7, I didn't try reproducing on 1.0 yet):

use actix_web::{HttpRequest, HttpResponse};

use std::sync::{Arc, Mutex};

struct Unit;
struct SlowStuff {}

impl SlowStuff {
    pub fn new() -> SlowStuff {
        // some delay here seems crucial, since I couldn't reproduce without it
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_secs(2));
        SlowStuff {}
    }
}

struct State {
    unit: Arc<Mutex<Unit>>,
    // wrapping the slow stuff into Arc<Mutex<>> can be done but doesn't matter
    slow_stuff: SlowStuff,
}

fn index(req: &HttpRequest<State>) -> HttpResponse {
    println!("got request to: {:?}", req.uri());

    {
        // Taking the lock is moderately important.
        // I think I've reproduces this several times without taking the
        // lock but it's very inconsistent.
        let unit: &Unit = &req.state().unit.lock().unwrap();
        // this delay is important. Couldn't reproduce without
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_millis(1000));
        println!("got unit: {:?}", unit as *const Unit);
    }

    let res = format!("hello");
    let resp = HttpResponse::Ok().body(res);
    println!("made a response");

    resp
}

fn main() {
    actix_web::server::new(move || {
        actix_web::App::with_state(State {
            unit: Arc::new(Mutex::new(Unit)),
            slow_stuff: SlowStuff::new(),
        })
        .resource("/", |r| r.f(index))
    })
    // worker count does not matter, but I assume 1 is easier to debug?
    .workers(1)
    .bind("127.0.0.1:8081")
    .unwrap()
    .run();
}

Go to http://localhost:80801/, see the page showing fine. If you refresh the page several times (it might need several refreshes in quick succession? unsure.. I just tap F5 real quick), the page would get into an eternal "loading" state, further attempts to access the endpoint (also from other browser tabs) would never give response.

Stacktrace as reported by ProcessExplorer in the frozen state (identical for both the main thread, and the two others):

ntoskrnl.exe!KiCpuId+0xaa
ntoskrnl.exe!ExAcquireSpinLockExclusiveAtDpcLevel+0x2c2
ntoskrnl.exe!KeWaitForSingleObject+0x1a3
ntoskrnl.exe!ExReleaseResourceForThreadLite+0x60
ntoskrnl.exe!KeInsertQueueApc+0x725
ntoskrnl.exe!ExAcquireSpinLockExclusiveAtDpcLevel+0x4cd
ntoskrnl.exe!KeRemoveQueueEx+0x323
ntoskrnl.exe!PsAssignImpersonationToken+0x13c
ntoskrnl.exe!RtlPinAtomInAtomTable+0x1338
ntoskrnl.exe!longjmp+0x5b93
ntdll.dll!ZwRemoveIoCompletionEx+0xa
KERNELBASE.dll!GetQueuedCompletionStatusEx+0xdf
kernel32.dll!GetQueuedCompletionStatusEx+0x19
actix-hang.exe!miow::iocp::CompletionPort::get_many+0x215

Windows 7, firefox browser.

P.s.: I cannot reproduce it on linux, so probably this is windows-specific..

fafhrd91 commented 5 years ago

std::thread::sleep blocks entire thread so no other request being served by that thread. so eventually you block all threads.

you can not execute blocking synchronous operation on async thread

moxian commented 5 years ago

you can not execute blocking synchronous operation on async thread Would that mean reading a file to serve it from a handler is not allowed, as that is a synchronous blocking operation?

I don't find this explanation compelling.

Yes, std::thread::sleep blocks the entire thread, but it only does so for until the sleep is over. Once the specified duration has passed, the thread should become unblocked and ready to serve further requests. In my case, however, the application hangs forever.

I've just restarted the binary, refreshed the page, refreshed it again, and then pressed F5 5 times in quick succession. It's been at least a minute since then, the page is still "loading", and trying to open it in a new tab yields no results. The console has:

PS D:\work\trash\repros\tera-hang> cargo run
   Compiling tera-hang v0.1.0 (D:\work\trash\repros\tera-hang)
warning: field is never used: `slow_stuff`
  --> src\main.rs:21:5
   |
21 |     slow_stuff: SlowStuff,
   |     ^^^^^^^^^^^^^^^^^^^^^
   |
   = note: #[warn(dead_code)] on by default

    Finished dev [unoptimized + debuginfo] target(s) in 29.53s
     Running `target\debug\tera-hang.exe`
SystemTime { intervals: 132059921846500000 } got request to: /
got unit: 0x257a89
made a response
SystemTime { intervals: 132059921866920000 } got request to: /
got unit: 0x257a89
made a response
SystemTime { intervals: 132059921886330000 } got request to: /
got unit: 0x257a89
made a response

The application is not being stuck inside my handler, but somewhere in actix instead.

fafhrd91 commented 5 years ago

Could you port your example to 1.0, 0.7 is in maintenance mode

moxian commented 5 years ago

Here is what seems to be 1.0 port:

use actix_web::{HttpRequest, HttpResponse};

use std::sync::{Arc, Mutex};

struct Unit;
struct SlowStuff {}

impl SlowStuff {
    pub fn new() -> SlowStuff {
        // some delay here seems crucial, since I couldn't reproduce without it
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_secs(2));
        SlowStuff {}
    }
}

struct State {
    unit: Arc<Mutex<Unit>>,
    // wrapping the slow stuff into Arc<Mutex<>> can be done but doesn't matter
    slow_stuff: SlowStuff,
}

fn index(req: HttpRequest, data: actix_web::web::Data<State>) -> HttpResponse {
    // let data = req.app_data::<State>().unwrap();
    println!(
        "{:?} got request to: {:?}",
        std::time::SystemTime::now(),
        req.uri()
    );

    {
        // Taking the lock is moderately important.
        // I think I've reproduces this several times without taking the
        // lock but it's very inconsistent.
        let unit: &Unit = &data.unit.lock().unwrap();
        // this delay is important. Couldn't reproduce without
        // it does not need to be thread::sleep, but can be
        // e.g. reading and parsing a big file.
        std::thread::sleep(std::time::Duration::from_millis(1000));
        println!("got unit: {:?}", unit as *const Unit);
    }

    let res = format!("hello");
    let resp = HttpResponse::Ok().body(res);
    println!("made a response");

    resp
}

fn main() {
    actix_web::HttpServer::new(move || {
        actix_web::App::new()
            .data(State {
                unit: Arc::new(Mutex::new(Unit)),
                slow_stuff: SlowStuff::new(),
            })
            .service(actix_web::web::resource("/").to(index))
    })
    // worker count does not matter, but I assume 1 is easier to debug?
    .workers(1)
    .bind("127.0.0.1:8080")
    .unwrap()
    .run()
    .unwrap();
}

but I cannot reproduce the issue with that.

Without understanding of what the root cause of the issue is in 0.7, I feel a little bit uneasy not knowing if it was actually fixed in 1.0, or just hidden (and now needs a somehow slightly tweaked example to resurface). In other words, I would be happy to see an investigation into what's going on, but not necessarily a fix.

That said, closing the issue now is probably also totally reasonable.

moxian commented 5 years ago

Hello, I ported my real code to actix 1.0, but unfortunately hit the same issue...

Here's the reproduction case - the only difference from above that was needed is an extra function that takes a reference to mutex-guarded contents:

Code

```rust use actix_web::{HttpRequest, HttpResponse}; use std::sync::{Arc, Mutex}; struct Unit; struct SlowStuff {} impl SlowStuff { pub fn new() -> SlowStuff { // some delay here seems crucial, since I couldn't reproduce without it // it does not need to be thread::sleep, but can be // e.g. reading and parsing a big file. std::thread::sleep(std::time::Duration::from_secs(2)); SlowStuff {} } } struct State { unit: Arc>, // wrapping the slow stuff into Arc> can be done but doesn't matter slow_stuff: SlowStuff, } fn index(req: HttpRequest, data: actix_web::web::Data) -> HttpResponse { // let data = req.app_data::().unwrap(); println!( "{:?} got request to: {:?}", std::time::SystemTime::now(), req.uri() ); let res = { let unit: &Unit = &data.unit.lock().unwrap(); // it is somehow important to pass the lock to a different function stuff(unit) }; let resp = HttpResponse::Ok().body(res); println!("made a response"); resp } fn stuff(_unit: &Unit) -> String { std::thread::sleep(std::time::Duration::from_millis(1000)); // this can be a constant string, using time only for demonstration return format!("{:?}", std::time::SystemTime::now()); } fn main() { actix_web::HttpServer::new(move || { actix_web::App::new() .data(State { unit: Arc::new(Mutex::new(Unit)), slow_stuff: SlowStuff::new(), }) .service(actix_web::web::resource("/").to(index)) }) // worker count does not matter, but I assume 1 is easier to debug? .workers(1) .bind("127.0.0.1:8080") .unwrap() .run() .unwrap(); } ```

P.s.: A data point: If there are several HttpServers in the actix system, then only the one bound to this particular address gets unresponsive. The rest continue working fine.

moxian commented 5 years ago

I tried running the example under a logger and got the following:

    Finished dev [unoptimized + debuginfo] target(s) in 6.00s
     Running `target\debug\main1.exe`
2019-06-27 05:53:36,219 ←[36mINFO ←[0m [actix_server::builder] Starting 1 workers
2019-06-27 05:53:36,220 ←[36mINFO ←[0m [actix_server::builder] Starting server on 127.0.0.1:8080
2019-06-27 05:53:38,223 ←[31mERROR←[0m [actix_server::accept] Can not resume socket accept process: The parameter is incorrect. (os error 87)
SystemTime { intervals: 132060884182248000 } got request to: /
made a response
SystemTime { intervals: 132060884203728000 } got request to: /
made a response

So this is the same root cause as in https://github.com/actix/actix-web/issues/905

I couldn't reproduce it with Debug or lower logging level so far, but I'll update the bug if I manage to

P.s.: I mentioned that my go-to reproduction steps here is to just press F5 several times in quick succession, but when I encountered that in real code, i was refreshing the page as one normally would, with no urgency, when the server was totally idle otherwise.

P.p.s: Managed to get Debug level:

log

``` 06:30:13 [ INFO] Starting 1 workers 06:30:13 [ INFO] Starting server on 127.0.0.1:8080 06:30:13 [DEBUG] (4) tokio_reactor::background: starting background reactor 06:30:13 [DEBUG] (1) tokio_reactor: adding I/O source: 0 06:30:13 [DEBUG] (1) tokio_reactor: adding I/O source: 4194305 06:30:13 [DEBUG] (1) tokio_reactor::registration: scheduling Read for: 0 06:30:13 [DEBUG] (1) tokio_reactor::registration: scheduling Read for: 1 06:30:15 [ERROR] Can not resume socket accept process: The parameter is incorrect. (os error 87) 06:30:15 [DEBUG] (2) tokio_reactor: adding I/O source: 0 06:30:15 [DEBUG] (2) tokio_reactor::registration: scheduling Read for: 0 06:30:15 [ INFO] got a request 06:30:16 [ INFO] made a response, returning 06:30:16 [DEBUG] (2) tokio_reactor: dropping I/O source: 0 ```

moxian commented 5 years ago

Caught the thing with Trace logging on - listener-trace-1561618348.log . This is a real application, so a bunch of unrelated stuff might be going on. There are, unfortunately, two http servers. The one affected is on port 8081

Edit: another one - listener-trace-1561619274.log. This time there's an Info line when exiting and entering the handler. The socket gets lost way before getting there. I believe (but am surprisingly unsure) that the content did get served to the browser, and it's only the subsequent requests that got stalled.

fafhrd91 commented 5 years ago

I don't have access to windows, and I can not reproduce this on osx

moxian commented 5 years ago

Is there anything I can help here with? I don't understand actix enough to debug this on my own, but would be happy to add whatever diagnostics needed.

moxian commented 5 years ago

I have delved into this a bit. The callstack is the following: actix_server::accept::Accept::process_cmd https://github.com/actix/actix-net/blob/server-v0.5.1/actix-server/src/accept.rs#L303 mio::poll::Poll::register https://github.com/tokio-rs/mio/blob/v0.6.19/src/poll.rs#L790 mio::sys::windows::selector::ReadyBinding::register_socket https://github.com/tokio-rs/mio/blob/v0.6.19/src/sys/windows/selector.rs#L361 mio::sys::windows::selector::Binding::register_socket https://github.com/tokio-rs/mio/blob/v0.6.19/src/sys/windows/selector.rs#L200 miow::iocp::CompletionPort::add_socket https://github.com/alexcrichton/miow/blob/0.2.1/src/iocp.rs#L74 miow::iocp::CompletionPort::_add https://github.com/alexcrichton/miow/blob/0.2.1/src/iocp.rs#L80 winapi::CreateIoCompletionPort

This is not exposed in the docs for some reason, but there is this interesting comment inside mio::register:

        /*
         * Undefined behavior:
         * - Reusing a token with a different `Evented` without deregistering
         * (or closing) the original `Evented`.
         */

Following this all way down to CreateIoCompletionPort MSDN entry, we see it mention, in passing: This completion key should be unique for each file handle, and it accompanies the file handle throughout the internal completion queuing process. And, evidently, violating this requirement gets is treated as passing in incorrect parameter, which is fair, I guess.

When I add diagnostic print, I see, that CreateIoCompletionPort is indeed called with the same non-null parameters multiple times. And this makes sense, since in the above trace logs you can see (The number in (brackets) is said to be thread number):

06:52:28 [TRACE] (6) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:28 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
<...>
06:52:30 [TRACE] (12) mio::sys::windows::selector: [C:\Users\moxian\.cargo\registry\src\github.com-1ecc6299db9ec823\mio-0.6.19\src\sys\windows\selector.rs:359] register Token(100) Readable
06:52:30 [ERROR] Can not resume socket accept process: The parameter is incorrect. (os error 87)
<...>

Why is token the same? I'm not sure... I tried reading actix-server code but didn't get far, unfortunately.

Hope this helps!..

(This is probably not needed, but just in case: another Trace log with a tiny bit more logging (and slightly different line numbers as a result): listener-1561709109-trace.log )

fafhrd91 commented 5 years ago

@moxian thanks for investigation, that is helpful. i'll try to come up with some fixes

imor commented 4 years ago

I ran into this on Windows 10 with actix-web 2.0 but unfortunately I did not have debug logging enabled and now I can't replicate. @moxian do you have any more information about this bug apart from what is mentioned in this issue? I'm happy to help troubleshoot in any way :)

moxian commented 4 years ago

No, sorry, I didn't use actix much since filing this report. This issue captures all I know about the problem.

pikitgb commented 4 years ago

I was able to see this issue, using React useEffect without adding a return [] in the effect method, I have a Rust/Actix app, in my routes

.route(
        "/api/v1/users{_:/?}",
        web::post().to(users_controller::index),
 )

all what my user controller is doing is somehow calling reqwest (this is some sort of proxy http call)

    .post("http://someotherwebservice:3333/api/legacy_users")
    .json(&request_body)
    .send()
    .await
    .expect("error")
    .json()
    .await
    .expect("error");

When i call this from the React client: this responds successfully the first time with data, but then second time it completely hangs Actix

React.useEffect(() => {
       await fetch('http://localhost_actix_port/api/v1/users');
});

But when adding [] to the React.useEffect it doesn't block the backend. This is a strange behavior because basically any app can completely create a DDoS attach to any backend that uses an endpoint that proxy http-requests in this way

React.useEffect(() => {
       await fetch('http://localhost_actix_port/api/v1/users');
}, []); <- this avoids an infinite loop

actix-cors = "0.2.0" actix-web = "2" actix-service = "1.0.5" actix-rt = "1" reqwest = { version = "0.10", features = ["blocking", "json"] }

I can create an example if necessary with that part of the app.

pikitgb commented 4 years ago

After digging in this issue a little more, my issue is actually different, seems like i was sending a burst of request to actix_web from the frontend (React), after x amount of request let's say a > 50+ per second actix was locking internally, and all the other subsequents requests also failed. I was able to resolve my issue by using this: add this dep to your Cargo.toml actix-ratelimit = "0.2.1"

use std::time::{Duration};
use actix_ratelimit::{RateLimiter, MemoryStore, MemoryStoreActor};

add this middleware:


.wrap(
        RateLimiter::new(
        MemoryStoreActor::from(store.clone()).start())
            .with_interval(Duration::from_secs(1))
            .with_max_requests(5))

Idk if this helps someone else with the same issue.

iamdejan commented 3 years ago

I was able to see this issue, using React useEffect without adding a return [] in the effect method, I have a Rust/Actix app, in my routes
.route(
        "/api/v1/users{_:/?}",
        web::post().to(users_controller::index),
 )
all what my user controller is doing is somehow calling reqwest (this is some sort of proxy http call)
   .post("http://someotherwebservice:3333/api/legacy_users")
   .json(&request_body)
   .send()
   .await
   .expect("error")
   .json()
   .await
   .expect("error");
When i call this from the React client: this responds successfully the first time with data, but then second time it completely hangs Actix
React.useEffect(() => {
       await fetch('http://localhost_actix_port/api/v1/users');
});
But when adding [] to the React.useEffect it doesn't block the backend. This is a strange behavior because basically any app can completely create a DDoS attach to any backend that uses an endpoint that proxy http-requests in this way
React.useEffect(() => {
       await fetch('http://localhost_actix_port/api/v1/users');
}, []); <- this avoids an infinite loop
actix-cors = "0.2.0" actix-web = "2" actix-service = "1.0.5" actix-rt = "1" reqwest = { version = "0.10", features = ["blocking", "json"] }

I can create an example if necessary with that part of the app.

Oh my God, same issue with me this month... I was desperate to find out the cause 😂

After digging in this issue a little more, my issue is actually different, seems like i was sending a burst of request to actix_web from the frontend (React), after x amount of request let's say a > 50+ per second actix was locking internally, and all the other subsequents requests also failed. I was able to resolve my issue by using this: add this dep to your Cargo.toml actix-ratelimit = "0.2.1"
use std::time::{Duration};
use actix_ratelimit::{RateLimiter, MemoryStore, MemoryStoreActor};
add this middleware:
.wrap(
        RateLimiter::new(
        MemoryStoreActor::from(store.clone()).start())
            .with_interval(Duration::from_secs(1))
            .with_max_requests(5))
Idk if this helps someone else with the same issue.

let me try this solution.

iamdejan commented 3 years ago

let me try this solution.

it seems this didn't work for me, because I didn't pass dependency array into useEffect hooks, so I opted for a refactor instead.

robjtede commented 2 years ago

closing as stale. if there are more updates, please open a new issue in the context of newer actix web versions

coffeebe4code commented 1 year ago

Having the same issue.

actix-web = 4.3.1 rand = "0.8" tokio = { version = "1.26", features = ["time"]

#[get("/small")]
async fn small() -> impl Responder {
    let mut rng = thread_rng();
    let rand: f64 = rng.gen();
    if rand >= 0.6 {
        tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;
    }
    HttpResponse::Ok().body(String::from("Hello world!"))
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    HttpServer::new(|| {
        App::new()
            .service(small)
    })
    .bind(("0.0.0.0", 8080))?
    .run()
    .await
}

I removed the random and sleep part, and it still hangs indefinitely.

coffeebe4code commented 1 year ago

I then removed the tokio specific install from cargo. downgraded to actix 3. I ran cargo clean here is new cargo.toml

[package]
name = "actix-test"
version = "0.1.0"
edition = "2021"
publish = ["cargo"]

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
actix-web = "3"

#[get("/small")]
async fn small() -> impl Responder {
    HttpResponse::Ok().body(String::from("Hello world!"))
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    HttpServer::new(|| {
        App::new()
            .service(small)
    })
    .bind(("127.0.0.1", 8080))?
    .run()
    .await
}

First response returns fine. Second one gets econnreset.

I removed the request body which was application/json, sure enough, fixed the issue.

I should just be pitching the request body, as i don't specify I want it in my get endpoint. The total size of the payload is 170kb. A possible default is 256kb, So i reduced the size a bit so that 2 requests should be fine and not overflow that size. So two at 120kb, still blew up on the second one. Even at size 76kb I had the issue. I went down to an absolutely tiny payload, and no longer got the issue.

What do I need to do, to get the framework to accept these payloads?

robjtede commented 1 year ago

If you send a payload, it should be read, or else the HTTP/1.1 connection will get stuck.

#[get("/small")]
async fn small(_body: Bytes) -> impl Responder {
  ...
}

actix / actix-web

Actix-web hangs forever if request processing takes a long time. #941