Closed kuhnchris closed 2 years ago
Thanks for opening an issue! We'll look into this.
It's not immediately clear to me what is going on, so I'll need your help to understand it better.
Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.
I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.
This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!
Thank you again for your report, we look forward to resolving it!
## 1. Environment
### 1a. Operating system and version
```
paste here
```
### 1b. Caddy version (run `caddy version` or paste commit SHA)
```
paste here
```
### 1c. Go version (if building Caddy from source; run `go version`)
```
paste here
```
## 2. Description
### 2a. What happens (briefly explain what is wrong)
### 2b. Why it's a bug (if it's not obvious)
### 2c. Log output
```
paste terminal output or logs here
```
### 2d. Workaround(s)
### 2e. Relevant links
## 3. Tutorial (minimal steps to reproduce the bug)
Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.
Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.
Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:
curl
.Example of a tutorial:
Create a config file: ``` { ... } ``` Open terminal and run Caddy: ``` $ caddy ... ``` Make an HTTP request: ``` $ curl ... ``` Notice that the result is ___ but it should be ___.
Did you try playing around with these buffering options? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#streaming
You might want to try setting the flush interval to -1.
Did you try playing around with these buffering options? https://caddyserver.com/docs/caddyfile/directives/reverse_proxy#streaming
You might want to try setting the flush interval to -1.
Yes, I played around with flush_interval, bufffer_responses and the other 2 seems worth a try, I'll play around and let you know, thanks for bringing this up!
Hello @mholt - thanks for replying, I tried to fill in your template as good as possible, please let me know if I should investigate the images from docker hub for the version requirements, or if I missed anything you require! FYI: I did see that you said "don't run caddy in a container" - due to the nature of this we do require it to run in a container tho. I could try to do it outside, but that would require you to either setup a python environment (for UWSGI), or nginx, for a minimalist testcase.
Thanks and all the b est, Chris
Docker
Whatever the official caddy:latest
is based on
caddy version
or paste commit SHA)Official caddy:latest
docker hub image.
At the time of writing: https://hub.docker.com/layers/caddy/library/caddy/latest/images/sha256-2728a7a5cc9045d475a134f9230565677fe26deb5306060a0ab766d8449f6ba4?context=explore
go version
)Please refer to the official caddy:latest
ducker hub image.
Upon client disconnect (hard, without the appropriate notifications) caddy
via reverse_proxy
forwards the client content unreliabily, sometimes entirely (OK), sometimes broken (error: context closed
), causing upstream/reverse_proxy application to receive error(s)
Inconsistancy; Upstream applications error out; in this particular use-case: low-latency streaming cuts off because of missing segments (due to context closed) or missing update of the m3u8 playlist.
poc-caddy-1 | {"level":"debug","ts":1659176760.2804174,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.001713273,"request":{"remote_ip":"172.18.0.1","remote_port":"51190","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example.m3u8","headers":{"User-Agent":["Lavf/58.76.100"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"Accept":["*/*"],"X-Forwarded-For":["172.18.0.1"],"X-Server-Name":["test"]}},"error":"context canceled"}
poc-caddy-1 | {"level":"debug","ts":1659176760.2806652,"logger":"http.handlers.reverse_proxy","msg":"upstream roundtrip","upstream":"uwsgi:8002","duration":0.00282887,"request":{"remote_ip":"172.18.0.1","remote_port":"51188","proto":"HTTP/1.1","method":"PUT","host":"localhost:8888","uri":"/example190.m4s","headers":{"X-Server-Name":["test"],"Accept":["*/*"],"X-Forwarded-Proto":["http"],"X-Forwarded-Host":["localhost:8888"],"Icy-Metadata":["1"],"X-Forwarded-For":["172.18.0.1"],"User-Agent":["Lavf/58.76.100"]}},"error":"context canceled"}
none available
PoC: https://github.com/kuhnchris/caddyserver-ffmpeg-ll-hls-poc
ffmpeg
and docker
installed!docker compose build
to create the required uwsgi
upstream applicationdocker compose up -d
to create the caddy and uwsgi (and nginx) containercd ffmpeg; sh ll_hls.sh
to create a low-latency HLS stream from ffmpeg sent to caddydocker compose logs -f
@kuhnchris Awesome, thanks for setting this up. I'll take a look as soon as I get in the office after the weekend!
Meanwhile, you and anyone are welcome to investigate further.
You're welcome. thanks for taking a peek. I'm not really a Go programmer, so please excuse my lack of knowledge presented here, but I followed that error pretty much down to this point: (reverseproxy.go -> res,err := h.Transport.RoundTrip(...) {line 740ff})
Sadly, it appears that the round trip is doing something, so I actually cannot progress further as I have no idea where the RoundTrip is defined (seems to be an internal go / http object?) - all I do know, is, that the result, incase of the roundtrip error, is empty - so I couldn't even do a h.bufferedBody
or anything along the lines. I, however, did notice this note within the RoundTrip interface definition:
Again, as I have no idea where the RoundTrip object is defined within caddy I cannot follow it, so, that's the maximum extend I can help out here, sorry.
Thanks again for checking and all the best, Chris
In your case, you're using the HTTP transport, so it goes here:
Which basically just calls down to the Go stdlib to do the work.
hmm hmm, so would it be a viable option to create sort of a mini app that keeps sending data to a endpoint (uwsgi) with the standard go http lib to see if a manual kill of the app would cause the same behavior? Since I doubt we can debug the default http lib...
OK, so I simplified all of this far more, so we have the following setup now:
So, you can go to the repository, check it out, start caddy in caddy/, and then run the send_app.py in ffmpeg_alternative
a couple of times, and it should trigger the error. Technically you may need to create a "test.m3u8" in /tmp, but it should cause the error without.
2022/08/09 20:10:00.270 DEBUG http.handlers.reverse_proxy upstream roundtrip {"upstream": "localhost:8881", "duration": 0.098810054, "request": {"remote_ip": "127.0.0.1", "remote_port": "36744", "proto": "HTTP/1.1", "method": "PUT", "host": "uwsgi", "uri": "/test.m3u8", "headers": {"X-Server-Name": ["test"], "X-Forwarded-For": ["127.0.0.1"], "User-Agent": [""], "X-Forwarded-Proto": ["http"], "X-Forwarded-Host": ["uwsgi"]}}, "error": "context canceled"}
2022/08/09 20:10:00.270 INFO http.log.access.default handled request {"request": {"remote_ip": "127.0.0.1", "remote_port": "36744", "proto": "HTTP/1.1", "method": "PUT", "host": "uwsgi", "uri": "/test.m3u8", "headers": {"X-Server-Name": ["wusgi"], "Connection": ["close"]}}, "user_id": "", "duration": 0.098967826, "size": 0, "status": 0, "resp_headers": {"Server": ["Caddy"]}}
I'm kinda at a loss here, seems like if I cannot solve this I may have to look for an alternative, if nobody else got any idea how we could at least avoid this issue...
Thanks again for looking and have a good one!
Great, thank you for the simpler reproducer! I need to find time to work on this as I'm currently deep in the mud on another complex issue. I'll return to this though, however if someone debugs this sooner that would definitely be faster!
@kuhnchris I was able to repro the error with the simpler instructions, will now investigate.
@kuhnchris Oh, this is simple: the Python script is closing the socket before the response is read:
...
# Comment those 2 lines to reproduce the problem
# response = sock.recv(4096)
# print(response.decode())
sock.close()
It's that recv()
call that's important. Since the client cancels the request before the response is received, it results in "context canceled".
AFAICT there's no bug here; that's normal and expected if the client cancels the request early.
Well, for low-latency streaming, that's the approach ffmpeg actually takes, so, we could change it in our example script, but I doubt we can get FFMPEG to change this. There is a parameter that closes the connection on their side, but this halts the entire process. The interesting thing is, it works if we do not use caddy as a middleware - the context break simply happens whenever the connection is just "closed" (in this case: because we do not care about the answer, the encoder just sends the data in and whatever happens with that is the receiver's call).
Hence the confusion of why this happens with the middleware inbetween. If that is expected behavior for caddy then I'll have to reroute the encoder data directly to the server, I just would have prefered to have everything go via Caddyserver.
@kuhnchris
Well, for low-latency streaming, that's the approach ffmpeg actually takes
ffmpeg closes the connection without waiting for or reading a response? I do find that very odd -- why don't they need a response?
We could probably make it configurable, but I'm having a hard time understanding why a proxy should keep a connection open even though the client isn't there anymore. Isn't that the definition of a resource leak? What do we do with the response, etc?
Sorry if it comes across like I'm pushing against this. I just want to understand it very thoroughly before making a change.
No problem, I'll try to explain: Low Latency Streaming basically means we try to push the data to the server as fast as technically possible. The response isn't really anything we care about, if the server breaks down during the transmission of the stream, there are more serious problems going on than having an issue while transmitting. That's why the response is pretty much irrelevant. The only thing relevant for the encoder is, to take data in as fast as possible (from camera, devices, etc.), transform it to target encoding and push it out of the memory for the next frame(pack) to be handled.
Hence: (video) --> (ffmpeg encoder) --> (caddy) --> (server)
The server actually has the function to take the data and distribute it to our client and target systems, so on that end it's handling all the proper HTTP requests, it's just that it cannot handle the data coming from caddy because the request, albeit it's sent and the client is long gone, is being refused to be transmitted to the server.
Technically, IMHO, it's like you get data into a cache, and the process that fills the cache simply drops - and for some reason, caddy refuses to give us the last of the data the cache-filler sent because there is noone left to tell that his data was transmitted.
I know this is a really edge edge-case, so I do understand if you say it's too specific for a general purpose reverse_proxy server, just giving it a try. ;-)
If it's still unclear let me know and I'll try to get another explaination for you.
I see. So maybe if flush_interval is -1, we should always finish writing to the backend even if the client closes the connection.
Yes, that would exactly make sense. If there is no, or an negative, flush_interval it would make sense that there is no buffering and it flushes it directly through to the (updial) server - reporting the error is fine for me, as long as the data is ending up on the server side. :-)
@kuhnchris This is a little tricky though, the standard library's http.Transport
is returning the context.Canceled
error - in other words, it's not our code that streams back the response. I will see if I can make it work, but no promises.
I haven't been able to actually prove this but based on some experiments with replacing the request's context with one that doesn't get canceled when the client disconnects, I believe the context.Canceled error is originating from here: https://cs.opensource.google/go/go/+/refs/tags/go1.19:src/net/http/transport.go;l=2665-2668;drc=c81dfdd47aa44bb1da0f60e36742cc0103da4837
@kuhnchris Could you please try the commit pushed to branch ctxcanceled
? https://github.com/caddyserver/caddy/tree/ctxcanceled - 3ffc391
I was able to get the error to go away using the send_app.py reproducer, but I doubt it actually fixes the problem. Still, I'd be interested in your test result.
@kuhnchris Could you please try the commit pushed to branch
ctxcanceled
? https://github.com/caddyserver/caddy/tree/ctxcanceled - 3ffc391I was able to get the error to go away using the send_app.py reproducer, but I doubt it actually fixes the problem. Still, I'd be interested in your test result.
Hello,
thanks, i'll try to grab the branch and check it as soon as I get to my work station - thanks for taking a look - will report back.
OK, checked it, sadly, the error still pops up on the "real" deal (i.e. the docker and ffmpeg combination), so I assume it may just be that particular method of closing the connection that worked, sorry. :(
Sorry, i have to wave back, something is fishy, i just tried it again with the mini-uwsgi example and added output to the files, it seems to output everything fine, so something is going on, I am having a co-worker investigate this and get to you guys real quick, sorry about the confusion!
I'll ping this thread again if something comes up.
so something is going on,
As in, something is working correctly? :smile: Anyway, I hope that's the case! Make sure you're running the right binary built from the right commit! (caddy version
can help with that)
Let me know!!
OK, the test that had the issue didn't have the "flush_interval: -1" in the config, that's why it looked like pre-fix. With this fix we actually managed to get sub-5s low latency streaming, so it did fix the issue in question and solves it for our usage. :-)
Hurray! Ok, I think this fix may be a viable solution then (once I clean it up slightly). We replace the request's context with one that doesn't get canceled when the client closes the connection just before we pass it to the standard library.
FYI @kuhnchris, this behavior might be changing if I merge 8c9e87d into mainline. The reason is I think my implementation of the patch is incorrect/buggy, since it overloads flush_interval
to do two different things.
Ideally though, we'd have a way to just close the connection to the backend when the upload is complete if the client has disconnected, but that'll require a little more work/testing.
Hello,
first of all, great project, love the caddy server - love what you guys are doing, keep up the good work! :+1:
Now for my problem:
I am using the caddy server as a reverse proxy for a python application that receives ffmpeg's streaming data and processes it (which is out of scope for this issue). Whenever the use-case has a client exiting the connection to caddy validly, we do not have any issues at all.
Now, we are trying some low-latency streaming, which involves that ffmpeg runs with the parameter
-http_persistent 0
. Now this causes problems with caddy. Using ffmpeg's http persistent 0 causes the connection to simply drop out without a clean "exit".It looks like, when we are using caddy as inbetween proxy, whenever the connection drops out, there's a 50:50 chance that it causes an issue on the python side (OSErrors).
If we remove caddy inbetween it works fine (streaming direct to the python server). This is not only related to UWSGI/python, we have the same issue if you just use plain old nginx with WebDAV to PUT the files on via
ffmpeg
.I have created a PoC project, containing a minimum viable client for this issue: https://github.com/kuhnchris/caddyserver-ffmpeg-ll-hls-poc
In the default configuration (
docker compose up -d && cd ffmpeg && sh ll_hls.sh
) we see that caddy sometimes works without any hickups, and then, randomly throws this on python (and nginx) side:i.e. "context canceled". But only sometimes it feels like the reverse_proxy target is not receiving everything, so it feels like this is a race condition.
on the nginx side it looks as follows:
the "400" correspond to the broken issue, the 204s and 201s are OK. This happens pretty randomly. I tried looking around for the aformentioned "context closed" and found 2 spots in the reverse_proxy.go, but I didn't really see any way to work around this issue.
Does anyone have any idea how we could get this to work without caddy randomly causing exceptions/errors when having the (client) connection closed?