caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.62k stars 4.05k forks source link

caddyserver-reverseproxy + ffmpeg streaming causes invalid files/connection state #4922

Closed kuhnchris closed 2 years ago

kuhnchris commented 2 years ago

Hello,

first of all, great project, love the caddy server - love what you guys are doing, keep up the good work! :+1:

Now for my problem:

I am using the caddy server as a reverse proxy for a python application that receives ffmpeg's streaming data and processes it (which is out of scope for this issue). Whenever the use-case has a client exiting the connection to caddy validly, we do not have any issues at all.

Now, we are trying some low-latency streaming, which involves that ffmpeg runs with the parameter -http_persistent 0. Now this causes problems with caddy. Using ffmpeg's http persistent 0 causes the connection to simply drop out without a clean "exit".

It looks like, when we are using caddy as inbetween proxy, whenever the connection drops out, there's a 50:50 chance that it causes an issue on the python side (OSErrors).

If we remove caddy inbetween it works fine (streaming direct to the python server). This is not only related to UWSGI/python, we have the same issue if you just use plain old nginx with WebDAV to PUT the files on via ffmpeg.

I have created a PoC project, containing a minimum viable client for this issue: https://github.com/kuhnchris/caddyserver-ffmpeg-ll-hls-poc

In the default configuration (docker compose up -d && cd ffmpeg && sh ll_hls.sh) we see that caddy sometimes works without any hickups, and then, randomly throws this on python (and nginx) side:

poc-caddy-1  | {"level":"debug","ts":1659176760.2804174,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.001713273,"request":{"remote_ip":"172.18.0.1","remote_port":"51190","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example.m3u8","headers":{"User-Agent":["Lavf/58.76.100"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"Accept":["*/*"],"X-Forwarded-For":["172.18.0.1"],"X-Server-Name":["test"]}},"error":"context canceled"}
poc-caddy-1  | {"level":"debug","ts":1659176760.2806652,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.00282887,"request":{"remote_ip":"172.18.0.1","remote_port":"51188","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example190.m4s","headers":{"X-Server-Name":["test"],"Accept":["*/*"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"X-Forwarded-For":["172.18.0.1"],"User-Agent":["Lavf/58.76.100"]}},"error":"context canceled"}

i.e. "context canceled". But only sometimes it feels like the reverse_proxy target is not receiving everything, so it feels like this is a race condition.

poc-uwsgi-1  | [pid: 9|app: 0|req: 61/379] 172.18.0.3 () {38 vars in 498 bytes} [Sat Jul 30 10:25:56 2022] PUT /example188.m4s => generated 0 bytes in 1 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 4)
poc-uwsgi-1  | Traceback (most recent call last):
poc-uwsgi-1  |   File "/root/./test_handler.py", line 12, in dispatch
poc-uwsgi-1  |     while _ := uwsgi.chunked_read(30):
poc-uwsgi-1  | OSError: unable to receive chunked part
poc-uwsgi-1  | [pid: 8|app: 0|req: 320/380] 172.18.0.3 () {38 vars in 494 bytes} [Sat Jul 30 10:25:58 2022] PUT /example.m3u8 => generated 0 bytes in 0 msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 4)
poc-uwsgi-1  | [pid: 8|app: 0|req: 320/381] 172.18.0.3 () {38 vars in 498 bytes} [Sat Jul 30 10:25:58 2022] PUT /example189.m4s => generated 2 bytes in 1 msecs (HTTP/1.1 200) 2 headers in 64 bytes (1 switches on core 2)
poc-uwsgi-1  | Traceback (most recent call last):
poc-uwsgi-1  |   File "/root/./test_handler.py", line 12, in dispatch
poc-uwsgi-1  |     while _ := uwsgi.chunked_read(30):
poc-uwsgi-1  | OSError: unable to receive chunked part

on the nginx side it looks as follows:

poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:44 +0000] "PUT /example.m3u8 HTTP/1.1" 204 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:44 +0000] "PUT /example392.m4s HTTP/1.1" 201 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:46 +0000] "PUT /example393.m4s HTTP/1.1" 201 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:46 +0000] "PUT /example.m3u8 HTTP/1.1" 204 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:48 +0000] "PUT /example.m3u8 HTTP/1.1" 400 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:48 +0000] "PUT /example394.m4s HTTP/1.1" 400 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:50 +0000] "PUT /example.m3u8 HTTP/1.1" 400 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:32:50 +0000] "PUT /example395.m4s HTTP/1.1" 201 0 "-" "Lavf/58.76.100"
...
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:33:30 +0000] "PUT /example.m3u8 HTTP/1.1" 204 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:33:32 +0000] "PUT /example.m3u8 HTTP/1.1" 204 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:33:32 +0000] "PUT /example416.m4s HTTP/1.1" 201 0 "-" "Lavf/58.76.100"
poc-nginx-1  | 172.18.0.3 - - [30/Jul/2022:10:33:34 +0000] "PUT /example.m3u8 HTTP/1.1" 400 0 "-" "Lavf/58.76.100"

the "400" correspond to the broken issue, the 204s and 201s are OK. This happens pretty randomly. I tried looking around for the aformentioned "context closed" and found 2 spots in the reverse_proxy.go, but I didn't really see any way to work around this issue.

Does anyone have any idea how we could get this to work without caddy randomly causing exceptions/errors when having the (client) connection closed?

mholt commented 2 years ago

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```

### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```

### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```

## 2. Description

### 2a. What happens (briefly explain what is wrong)

### 2b. Why it's a bug (if it's not obvious)

### 2c. Log output

```
paste terminal output or logs here
```

### 2d. Workaround(s)

### 2e. Relevant links

## 3. Tutorial (minimal steps to reproduce the bug)

Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
francislavoie commented 2 years ago

Did you try playing around with these buffering options? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#streaming

You might want to try setting the flush interval to -1.

kuhnchris commented 2 years ago

Did you try playing around with these buffering options? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#streaming

You might want to try setting the flush interval to -1.

Yes, I played around with flush_interval, bufffer_responses and the other 2 seems worth a try, I'll play around and let you know, thanks for bringing this up!

kuhnchris commented 2 years ago

Hello @mholt - thanks for replying, I tried to fill in your template as good as possible, please let me know if I should investigate the images from docker hub for the version requirements, or if I missed anything you require! FYI: I did see that you said "don't run caddy in a container" - due to the nature of this we do require it to run in a container tho. I could try to do it outside, but that would require you to either setup a python environment (for UWSGI), or nginx, for a minimalist testcase.

Thanks and all the b est, Chris

Template

1. Environment

Docker

1a. Operating system and version

Whatever the official caddy:latest is based on

1b. Caddy version (run caddy version or paste commit SHA)

Official caddy:latest docker hub image. At the time of writing: https://hub.docker.com/layers/caddy/library/caddy/latest/images/sha256-2728a7a5cc9045d475a134f9230565677fe26deb5306060a0ab766d8449f6ba4?context=explore

1c. Go version (if building Caddy from source; run go version)

Please refer to the official caddy:latest ducker hub image.

2. Description

2a. What happens (briefly explain what is wrong)

Upon client disconnect (hard, without the appropriate notifications) caddy via reverse_proxy forwards the client content unreliabily, sometimes entirely (OK), sometimes broken (error: context closed), causing upstream/reverse_proxy application to receive error(s)

2b. Why it's a bug (if it's not obvious)

Inconsistancy; Upstream applications error out; in this particular use-case: low-latency streaming cuts off because of missing segments (due to context closed) or missing update of the m3u8 playlist.

2c. Log output

poc-caddy-1  | {"level":"debug","ts":1659176760.2804174,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.001713273,"request":{"remote_ip":"172.18.0.1","remote_port":"51190","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example.m3u8","headers":{"User-Agent":["Lavf/58.76.100"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"Accept":["*/*"],"X-Forwarded-For":["172.18.0.1"],"X-Server-Name":["test"]}},"error":"context canceled"}
poc-caddy-1  | {"level":"debug","ts":1659176760.2806652,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.00282887,"request":{"remote_ip":"172.18.0.1","remote_port":"51188","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example190.m4s","headers":{"X-Server-Name":["test"],"Accept":["*/*"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"X-Forwarded-For":["172.18.0.1"],"User-Agent":["Lavf/58.76.100"]}},"error":"context canceled"}

2d. Workaround(s)

none available

2e. Relevant links

PoC: https://github.com/kuhnchris/caddyserver-ffmpeg-ll-hls-poc

3. Tutorial (minimal steps to reproduce the bug)

mholt commented 2 years ago

@kuhnchris Awesome, thanks for setting this up. I'll take a look as soon as I get in the office after the weekend!

Meanwhile, you and anyone are welcome to investigate further.

kuhnchris commented 2 years ago

You're welcome. thanks for taking a peek. I'm not really a Go programmer, so please excuse my lack of knowledge presented here, but I followed that error pretty much down to this point: (reverseproxy.go -> res,err := h.Transport.RoundTrip(...) {line 740ff}) image

Sadly, it appears that the round trip is doing something, so I actually cannot progress further as I have no idea where the RoundTrip is defined (seems to be an internal go / http object?) - all I do know, is, that the result, incase of the roundtrip error, is empty - so I couldn't even do a h.bufferedBody or anything along the lines. I, however, did notice this note within the RoundTrip interface definition: image Again, as I have no idea where the RoundTrip object is defined within caddy I cannot follow it, so, that's the maximum extend I can help out here, sorry.

Thanks again for checking and all the best, Chris

francislavoie commented 2 years ago

In your case, you're using the HTTP transport, so it goes here:

https://github.com/caddyserver/caddy/blob/db1aa5b5bc174e5a5df39a277f737b304e1e2350/modules/caddyhttp/reverseproxy/httptransport.go#L299

Which basically just calls down to the Go stdlib to do the work.

kuhnchris commented 2 years ago

hmm hmm, so would it be a viable option to create sort of a mini app that keeps sending data to a endpoint (uwsgi) with the standard go http lib to see if a manual kill of the app would cause the same behavior? Since I doubt we can debug the default http lib...

kuhnchris commented 2 years ago

OK, so I simplified all of this far more, so we have the following setup now:

So, you can go to the repository, check it out, start caddy in caddy/, and then run the send_app.py in ffmpeg_alternative a couple of times, and it should trigger the error. Technically you may need to create a "test.m3u8" in /tmp, but it should cause the error without.

2022/08/09 20:10:00.270 DEBUG   http.handlers.reverse_proxy upstream roundtrip  {"upstream": "localhost:8881", "duration": 0.098810054, "request": {"remote_ip": "127.0.0.1", "remote_port": "36744", "proto": "HTTP/1.1", "method": "PUT", "host": "uwsgi", "uri": "/test.m3u8", "headers": {"X-Server-Name": ["test"], "X-Forwarded-For": ["127.0.0.1"], "User-Agent": [""], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["uwsgi"]}}, "error": "context canceled"}
2022/08/09 20:10:00.270 INFO    http.log.access.default handled request {"request": {"remote_ip": "127.0.0.1", "remote_port": "36744", "proto": "HTTP/1.1", "method": "PUT", "host": "uwsgi", "uri": "/test.m3u8", "headers": {"X-Server-Name": ["wusgi"], "Connection": ["close"]}}, "user_id": "", "duration": 0.098967826, "size": 0, "status": 0, "resp_headers": {"Server": ["Caddy"]}}

I'm kinda at a loss here, seems like if I cannot solve this I may have to look for an alternative, if nobody else got any idea how we could at least avoid this issue...

Thanks again for looking and have a good one!

mholt commented 2 years ago

Great, thank you for the simpler reproducer! I need to find time to work on this as I'm currently deep in the mud on another complex issue. I'll return to this though, however if someone debugs this sooner that would definitely be faster!

mholt commented 2 years ago

@kuhnchris I was able to repro the error with the simpler instructions, will now investigate.

mholt commented 2 years ago

@kuhnchris Oh, this is simple: the Python script is closing the socket before the response is read:

...
# Comment those 2 lines to reproduce the problem
# response = sock.recv(4096)
# print(response.decode())

sock.close()

It's that recv() call that's important. Since the client cancels the request before the response is received, it results in "context canceled".

AFAICT there's no bug here; that's normal and expected if the client cancels the request early.

kuhnchris commented 2 years ago

Well, for low-latency streaming, that's the approach ffmpeg actually takes, so, we could change it in our example script, but I doubt we can get FFMPEG to change this. There is a parameter that closes the connection on their side, but this halts the entire process. The interesting thing is, it works if we do not use caddy as a middleware - the context break simply happens whenever the connection is just "closed" (in this case: because we do not care about the answer, the encoder just sends the data in and whatever happens with that is the receiver's call).

Hence the confusion of why this happens with the middleware inbetween. If that is expected behavior for caddy then I'll have to reroute the encoder data directly to the server, I just would have prefered to have everything go via Caddyserver.

mholt commented 2 years ago

@kuhnchris

Well, for low-latency streaming, that's the approach ffmpeg actually takes

ffmpeg closes the connection without waiting for or reading a response? I do find that very odd -- why don't they need a response?

We could probably make it configurable, but I'm having a hard time understanding why a proxy should keep a connection open even though the client isn't there anymore. Isn't that the definition of a resource leak? What do we do with the response, etc?

Sorry if it comes across like I'm pushing against this. I just want to understand it very thoroughly before making a change.

kuhnchris commented 2 years ago

No problem, I'll try to explain: Low Latency Streaming basically means we try to push the data to the server as fast as technically possible. The response isn't really anything we care about, if the server breaks down during the transmission of the stream, there are more serious problems going on than having an issue while transmitting. That's why the response is pretty much irrelevant. The only thing relevant for the encoder is, to take data in as fast as possible (from camera, devices, etc.), transform it to target encoding and push it out of the memory for the next frame(pack) to be handled.

Hence: (video) --> (ffmpeg encoder) --> (caddy) --> (server)

The server actually has the function to take the data and distribute it to our client and target systems, so on that end it's handling all the proper HTTP requests, it's just that it cannot handle the data coming from caddy because the request, albeit it's sent and the client is long gone, is being refused to be transmitted to the server.

Technically, IMHO, it's like you get data into a cache, and the process that fills the cache simply drops - and for some reason, caddy refuses to give us the last of the data the cache-filler sent because there is noone left to tell that his data was transmitted.

I know this is a really edge edge-case, so I do understand if you say it's too specific for a general purpose reverse_proxy server, just giving it a try. ;-)

If it's still unclear let me know and I'll try to get another explaination for you.

mholt commented 2 years ago

I see. So maybe if flush_interval is -1, we should always finish writing to the backend even if the client closes the connection.

kuhnchris commented 2 years ago

Yes, that would exactly make sense. If there is no, or an negative, flush_interval it would make sense that there is no buffering and it flushes it directly through to the (updial) server - reporting the error is fine for me, as long as the data is ending up on the server side. :-)

mholt commented 2 years ago

@kuhnchris This is a little tricky though, the standard library's http.Transport is returning the context.Canceled error - in other words, it's not our code that streams back the response. I will see if I can make it work, but no promises.

mholt commented 2 years ago

I haven't been able to actually prove this but based on some experiments with replacing the request's context with one that doesn't get canceled when the client disconnects, I believe the context.Canceled error is originating from here: https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/net/http/transport.go;l=2665-2668;drc=c81dfdd47aa44bb1da0f60e36742cc0103da4837

mholt commented 2 years ago

@kuhnchris Could you please try the commit pushed to branch ctxcanceled? https://github.com/caddyserver/caddy/tree/ctxcanceled - 3ffc391

I was able to get the error to go away using the send_app.py reproducer, but I doubt it actually fixes the problem. Still, I'd be interested in your test result.

kuhnchris commented 2 years ago

@kuhnchris Could you please try the commit pushed to branch ctxcanceled? https://github.com/caddyserver/caddy/tree/ctxcanceled - 3ffc391

I was able to get the error to go away using the send_app.py reproducer, but I doubt it actually fixes the problem. Still, I'd be interested in your test result.

Hello,

thanks, i'll try to grab the branch and check it as soon as I get to my work station - thanks for taking a look - will report back.

kuhnchris commented 2 years ago

OK, checked it, sadly, the error still pops up on the "real" deal (i.e. the docker and ffmpeg combination), so I assume it may just be that particular method of closing the connection that worked, sorry. :(

kuhnchris commented 2 years ago

Sorry, i have to wave back, something is fishy, i just tried it again with the mini-uwsgi example and added output to the files, it seems to output everything fine, so something is going on, I am having a co-worker investigate this and get to you guys real quick, sorry about the confusion!

image

I'll ping this thread again if something comes up.

mholt commented 2 years ago

so something is going on,

As in, something is working correctly? :smile: Anyway, I hope that's the case! Make sure you're running the right binary built from the right commit! (caddy version can help with that)

Let me know!!

kuhnchris commented 2 years ago

OK, the test that had the issue didn't have the "flush_interval: -1" in the config, that's why it looked like pre-fix. With this fix we actually managed to get sub-5s low latency streaming, so it did fix the issue in question and solves it for our usage. :-)

mholt commented 2 years ago

Hurray! Ok, I think this fix may be a viable solution then (once I clean it up slightly). We replace the request's context with one that doesn't get canceled when the client closes the connection just before we pass it to the standard library.

mholt commented 1 month ago

FYI @kuhnchris, this behavior might be changing if I merge 8c9e87d into mainline. The reason is I think my implementation of the patch is incorrect/buggy, since it overloads flush_interval to do two different things.

Ideally though, we'd have a way to just close the connection to the backend when the upload is complete if the client has disconnected, but that'll require a little more work/testing.