[BUG?]: Payloads with size more than 512kb can't be written to the STDOUT

rustatian commented 9 months ago

Hey @FlashSystems 👋 Thanks for the library, good job 👍

I'm working on a SAPI PHP integration with fastCGI transport and found a kinda strange behavior when writing payloads with a body which is more than 512kb.

I'll show a simple example on how to reproduce this issue:

let mut stdout = req.get_stdout();

stdout.write(b"Content-Type: text/html\r\n").await.unwrap();
stdout.write(b"\r\n").await.unwrap();

let mut buf = Vec::new();

for _ in 0..800000 {
    buf.push(b'-');
}

stdout.write(&buf).await.unwrap();
return RequestResult::Complete(0);

This needs to be pasted in the process callback (as in you simple example). But, if you change 800000 to 524288 (512kb) or less, everything will work w/o any problem.

In the nginx debug log, I see the following error:

2024/01/22 22:53:51 [error] 61119#0: *1 upstream sent unsupported FastCGI protocol version: 45 while reading upstream, client: 127.0.0.1

I don't know, but I guess this might be because finish sometimes might be written before the actual payload chunks were written. But what is super strange, that everything works good with the payloads up to 512Kb.

FlashSystems commented 9 months ago

I've tested your example code using Apache and could download a megabyte of dashes without any problems. I've to set up a test bed using Nginx. This may take a few days. Either it is a bug in Nginx or a compatibility issue between Nginx and tokyo-fastcgi.

Could you please test if setting fastcgi_buffering to off or setting fastcgi_buffers to 128 8k makes any difference?

rustatian commented 9 months ago

I've tried this option previously, but, still there are a bunch of error responses:

Also, as you may see on the screenshot, I can have like 10 not completed responses, then 1-2-3 completed, and so on. And also, if you check the data with Wireshark/Tshark or similar tools, you'll see the broken (or not delivered properly) response:

I didn't tested on Apache, but, I tried PHP-FPM+NGINX (I'm not a PHP dev, just wanted to compare). And with PHP-FPM there are 0 errors. I event tried to load test it to see, if the errors come later, but no, everything is good on it.

rustatian commented 9 months ago

I also guessed if that error appears only on my Mac, switched to my Arch laptop, bring up the same env from 0, and ehh, still saw the errors.

FlashSystems commented 9 months ago

Hello @rustatian, I had some time to spare today and have set up a Nginx instance. I've used the example with a payload size of 1.048.576 bytes. I've run 4 curl instances in parallel in a loop, each fetching these 1 MiByte of dashes in parallel. Nothing. No errors, all output is delivered correctly.

I've used this - very bare bones - configuration for Nginx:

worker_processes  2;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    upstream fastcgi_backend {
        server localhost:8080;

        keepalive 8;
    }

    sendfile on;
    keepalive_timeout 10;

    server {
        listen       80;
        server_name  localhost;

        location / {
                fastcgi_pass  fastcgi_backend;
                fastcgi_keep_conn on;
        }
    }
}

My tests are run with nginx version 1.24.0.

I'm using this modified example.rs in case you want to try it out yourself:

use tokio::net::TcpListener;
use tokio_fastcgi::{Requests, RequestResult};

#[tokio::main]
async fn main() {
    let addr = "127.0.0.1:8080";
    let listener = TcpListener::bind(addr).await.unwrap();

    loop {
        let connection = listener.accept().await;
        // Accept new connections
        match connection {
            Err(err) => {
                println!("Establishing connection failed: {}", err);
                break;
            },
            Ok((mut stream, address)) => {
                println!("Connection from {}", address);

                // If the socket connection was established successfully spawn a new task to handle
                // the requests that the webserver will send us.
                tokio::spawn(async move {
                    // Create a new requests handler it will collect the requests from the server and
                    // supply a streaming interface.
                    let mut requests = Requests::from_split_socket(stream.split(), 10, 10);

                    // Loop over the requests via the next method and process them.
                    while let Ok(Some(request)) = requests.next().await {
                        if let Err(err) = request.process(|request| async move {
                            println!("Request");
                            let mut stdout = request.get_stdout();

                            stdout.write(b"Content-Type: text/html\r\n").await.unwrap();
                            stdout.write(b"\r\n").await.unwrap();

                            let buf = vec![b'-'; 1024*1024];

                            stdout.write(&buf).await.unwrap();
                            RequestResult::Complete(0)
                        }).await {
                            // This is the error handler that is called if the process call returns an error.
                            println!("Processing request failed: {}", err);
                        }
                    }
                });
            }
        }
    }
}

Could you please post your Nginx version and configuration so I can more closely reproduce your setup?

rustatian commented 9 months ago

Thanks for spending time on this 👍 I tried exactly your configuration and still no luck. I also tried to double check curl with curl <url> -o file.txt and on every request I have a file with a different size and curl error.

To be sure, I have everything compiled with the latest rust version (1.75.0), and latest nginx version: 1.25.3.

But, after the one restart, the strange thing happened. All my requests were completed w/o errors. After nginx restart - errors returned 😢

rustatian commented 9 months ago

Ok, I suspect that I found the issue. This issue is nginx on Mac... I did re-test on the Arch and Ubuntu with the latest nginx version and... no errors. One more reason to wait for the Asahi Linux 😃

rustatian commented 9 months ago

The only one strange thing, that FPM works w/o problems with nginx on Mac...

rustatian commented 9 months ago

Hey @FlashSystems 👋 During these few days, I did some benchmarks/tests (amd64 platform). This is difficult to trace nor explain, but I definitely see complete hangs during the tests. I use k6 with a standard configuration, with different number of vus (connected users). For the test purposes, I tried to simulate that behavior with the Golang stdlib fcgi package, no luck, unfortunately. I also tried FPM with 1 static process, still no luck, 100% processed successfully.

FlashSystems commented 9 months ago

Hello @rustatian, thank you for testing further. I've two questions:

Is this on Linux or on Mac. I've no access to a Mac and it would be great if could reproduce the problem on Linux.
Could you try to add a .flush().unwrap(); to the stdout stream and test again?

rustatian commented 9 months ago

Linux, Arch (amd64).
Sure, will do.

FlashSystems commented 9 months ago

Hello @rustatian, I was now able to reproduce the problem. It seems to be a race condition somewhere because it only happens on high throughput and if much data is returned. I'll investigate.

FlashSystems commented 9 months ago

I think I found the problem: Under high load, writes do not complete atomically any more. My code did not handle that case correctly, and the FastCGI protocol got desynchronized because there were not as many bytes written as stated within the message header. Then all subsequent writes on that connection failed. Commit 1b3d0d2b fixes this behaviour and retries the partial write with the unwritten half of the buffer. In my tests I could handle 500 Parallel connections downloading 2 MiByte of dashes without errors. Above that limit, other problems (like not enough file handles) started to appear, but the FastCGI server was rock solid ;) @rustatian: Could you please retest with the HEAD of the master branch. I hope this solves your problem (even on Mac).

rustatian commented 9 months ago

Cool, thanks @FlashSystems 👍 I'll retry my tests today and will let you know the results ⚡

rustatian commented 9 months ago

@FlashSystems I did a quick test, no hangs or errors, everything is pretty good. I still need to test our SAPI with the master branch, but I'm in the middle of a big refactoring, need a couple of days to finish and do a final test.

rustatian commented 9 months ago

Hey @FlashSystems 👋 Can't reproduce this behavior anymore. Loaded my server like never before, everything looks good now 👍

FlashSystems commented 9 months ago

Fixes with Release 1.2.0.

FlashSystems / tokio-fastcgi

[BUG?]: Payloads with size more than 512kb can't be written to the STDOUT #4